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© Apep I/O programmable router* 



© A partial array processor tor massively parauel applications 13 tormed wKh low power CMOS with ORAM 
prooessfeig while mcorporating prooasdng otemente on e single chip. Eight processor* on a single chip have 

alhe* own associated processing element, significant memory, and I/O end are tncerconriected with a hypercube 
fc&sed, out modified, topology. The** node* ere then interconnected, either oy a hypercuue, modified hyper- 

8 cube, or ring, or ring wHhin ring network topology. Conventional microprocessor MMPs consume pins and time 
going to memory. The new architecture merges processor and memory with multiple PMEe {eight 1* ttt 
IN processors wah 32K and VO) In DRAM and has no memory access delays and uses ail the pins tor ntfv/orMng, 
O The chip can be a single node of a fine-grained parallel processor Each chip win have eight 16 bit processors, 
IN each processor providing 5 MlPa performance* I/O has three internal ports and one external port shared by the 
■fl plural processors on As chip. Significant software flexlbifity is provided to enable quick implementation of 
O existing programs written In common languages, lite a devetopabfe and expandable technology without need to 

& develop new ptoouts, new software, or new utilities ee chip density increases end new hardware is provided for e 
chip function. The scalable chip PME hes Interned end external connection 8 for broadcast and asynchronous 
SlMD, MIMD ami SlMMD (SJMD/MIMD) wlm dynamic switching of modes. The crap can be used In systems 
which employ 32, 84 or 120000 processor* and can be used for lower, htarifiecBate and higher ranges. Local 
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and global memory Functions can ail be provided by lie clips themselves, and Hie system can connect to and 
support other globe* memories end DASD. The cMp can be used as a microprocessor accelerator. In personal 
computer applications, at a vision or avtonfca computer system, or as workstation or supercomputer. There is 
pro-am cempaSbisty tor the Ally scalable system. 
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HELP OF THE .INVENTIONS 

The invemion relates to computet and computet systems and particularly to parallel array processor* 
In accordance wnh the invention, a parallel array processor (APAP) may be incorporated on a tingle 
6 semiconductor silicon chip. This chip forms a basis for trie systems described which are capable of 
maesrvely parallel processing or complex scienSfte end business applications 

REFERENCES USED IN THE DISCUSSION OF THE INVENTIONS 

w In the detailed discussion of ihe invention, ofcsr works vwU ba referenced, including references to our 
own unpublished works which ere not Prior Art, which wfflakJtne reader in following the discussion. 

GLOSSARY OF TERMS 

;s o ALU 

ALU la the arithmetic logic unit portion of a processor. 
0 Array 

Array refers to an arrangement of elements In one or more dimenskme- An array can Include an 
ordered set of data Hems (array element) which in languages Oca Fortran are Identified by a single 

3> name, in other languages such a name of an ordered set of date tons refers io an ordered collection 
or set of data elements, aH of which have identical attributes A program array has dimensions 
specified, generally by a number or dimension attribute. The declarator of the array may also specify 
Ihe size of each dimension of the array in soma languages, m some language*, an array Is en 
arrangemeni of elements in a table. In a hardware sense, an army is a collection of structures 

£S (functional elements] which are generally Identical In a massively peraBeJ architecture. Array elements 
in data partial compiriing are elements which can be assigned operations and whan parallel can each 
Independently and in parallel execute the operations required. Generally, arrays may be thought of as 
grids of processing elements. Sections of the array may be assigned sectional data, so that sectional 
data can be moved around in a regular grid petfem. However, data can be indexed or assigned to en 

so arbitrary local ton In an array. 

0 Array Director 

An Array Director is a unit programmed as a control ter lor an array. It performs the function of a 
master controller for a grouping of functional elements arranged In an array, 
o Anay Processor 

ss There two principal types of array processors - mufiipte instruction mufti pte date (MIMD) and single 

instruction multiple data (SiMD). In a MlMD array processor, each processing element bvtn* array 
executes its own unique Instruction stream with its own oata In a SWD array processor, each 
processing Bremen* In the array is restricted to the same Instruction via a common instruction stream; 
however, the data associated with each processing element Is unique. Our preferred array processor 

<o has orner cftaracterissc*. We call it Advanced Parallel Array Processor, and use the acronym APAP. 
o Asynchronous 

Asynchronous Is without a regular time retattooship; me execution of a Junction is unpredictable with 
respect to me execution of other functions which occur without a regular or predictable time 
ralafonahip to other function executions. In control situations* a controller will address a location to 
* wtuch oornroi is passed when data is wafeng for an idle element being addressed. TOs permits 
opsrasons to remain in a sequsnos while they are out of time coincidence wfth any event 
o B0P&G0P8 

BQPS or OOPS are acronyms having the same meaning - billvons of operations per second. See 
OOPS. 

3o o Circuit Switctted/Store Forward 

These terms refer to two mechanisms for moving data packets through a network of nodes. Store 
Forward is a mechanism whereby a data packet Is received by each intermediate node, stored into its 
memory, and then forwarded on towards Us destination. Circuit Switch is a mechanism whereby an 
Intermediate node is commanded to logically connect its input port to an output port such that daft 

58 packets can pass directly through Ihe node towards their destination, without entering the intermedlote 

node's memory. 

0 CluSfer 

A duster is a station (or functional unit) which consists at a control unit fciuster controller) and the 



3 



EP 0 570 729 A2 



hardware (which may be terminals, functor units, or virtual comporwnbs) attached to it Our Cluster 
rndudee an array of PMEa sometimes caSed a Mode array. Usually a duster has 612 PMEs. 

Our Entire PME node array consists of a sat of duster each ctoter supported by a cluster 
oonSrower (OQ. 

s o Cluster controller 

A cluster controller is a device th& controls mpufrjutpui {t'O) operaaons for mora than one devtoe or 
functional unit cannactad to It A duswr corrardfer la usually corwrcllad by a program stored ami 
executed In fro unit as « was m the bm 3901 Finance Communication Controller, but ft can bo 
entirety controlled by hardware ae n vm In the IBM 3272 Control Unit 

w o Cluster synchroniser 

A cluster synchronizer la a ajnctlonaJ unit which maneges the operations of ell or pert of a cluster to 
maintain synchronous operation of the elements so tfiat the Junctional units maintem a particular time 
relationship with the execution of a program. 
0 Controller 

js A controOer is a device fat (tracts the transmission of data and instructions over foe links of an 

Interconnection network; Its operation rs controled by a program executed by a processor to which 
the controOer & connected or by a program executed within the device, 
o CMOS 

CMOS Is an acronym for Complementary MetaJ-Oxtde Semiconductor technology. It is commonly 
aj used to manufacture dynamic random eoceaa memories <DRAMs). nmos is another tecrwotooy used 
to manufacture DRAMS, We prefer CMOS but the technology used to manufacture the APAP is not 
intended to Imft the scope of the semiconductor technology which is employed, 
o Dotdng 

Dotting refers to the joining of three or mora leads by physically connecting them togefter. Most 
a? backpanei busses share this connection approach. The term relates to On DOTS of times past but is 
used here to identity multiple data sources that can be combined onto a bus by a vary simple 
protocol. 

Our I/O ztppet concept can be used to Implement the concept that the port into a node could be 
driven by ma port out of a node or by data corning from the system boa. Conversely, data being put 
so out of a node would be avai able to bom the input to another node and to the system boa. Note mat 

outpumng data to both the system bus and another node is not done sirmi&aneously but in different 
cyctas. 

Dotting Is used In the H-OOT discussions where Two-ported PEs or PMEs or Pickets can be used 
In arrays of various organizations by taking advantage of dotting. Several topologies are discussed 
as indudmg 20 and 3D Meshes, Base 2 N-cube, Sparse Bass 4 N-cube, and Sparse Base 8 N-cube. 

0 DRAM 

DRAM is an acronym tor dynamic random access memory, the common storage used by computers 
for main memory* However, the term DRAM can be applied to use as a cache or as a rrwrnory which 
la not the main memory- 

«* 0 aQATING-POINT 

A floating-point number la expressed in two parts. There is a fixed point or fraction pari, and an 
exponent part to some assumed radix or Base. The exponent Indicates the actual placement off the 
decimal point. In the typical floatingpoint representation a real number 0.0001234 is represented as 
0.1234-0, where 0.1234 Is the fixed-point part and O is the exponent In this example, fee floating- 

* point radix or base la 10, where 10 represents the impftcrt fixed posHrve integer base* greater than 
unity, tot is raised Id the power explicitly denoted by the exponent m the flosti rig-point representation 
or represented by the characteristic in the floating- point representation and Stan multiplied by the 
fixeovpoirit part to determine the real number represented* Numeric literals can be expiessed in 
floating-point notation as well as real numbers. 

so o FLOPS 

This terms refers to fioath^g-poim InaiructJone par second, fioatino-POKtf operations include ADD. 
SUB, MPY, OiV and often many others. Ftoating-point instructions per second paramerBr is often 
calculated using the add or multiply instructions and. in general, may be considered to have a 50V50 
mix. An operation includes the generation of exponent, fraction and any required fteaon normalize 
ss tton. We could address 32 or 4£Kxt fasting-point formats (or longer but we have not counted them in 

the mix.) A floating-point operation when implemented wf*S fixed point instructions {normal or RISC) 
requires muftipfe instructions. Some use a 10 to 1 ratio in figuring performance tvhil* some specific 
studies have shown a ratio of 855 more appropriate to use. Various architectures wiN have tfftereni 
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ratios, 
o Fijncaonal unit 

A functional unit is an enffly of hardware, software*, or both, capable of accomplishing a purpose 
o Gbytes 

5 Gbytes reters to a ciPion bytes. Gbyte&s would be a union bytes par second, 
o GIGAFLOPS 

floatlng-poln? Instructions per second, 
o QOPSand PETAOP9 

GCP8 or BOPS, have fte came meaning • Wiltons of operations per second PETAOPS means 
jo trillions of operations per second, a potential of the currant machma For our APAP machine ftsy are 
just about the same as HPs/GPs meaning bullions of Insfrucflons per second. In some machines en 
instruction can cease two or mete operations (te. both an add and multiply) but we don't do thai 
Alternatively It could take many itetnictions to do an op. For exampie we use multiple instructions to 
perform 64 bit arithmetic. In counting ops however, we did not elect Id count log ops. GQP8 may be 
*a the preferred use to describe performance, but there la no consistency tn usage that has been noted. 
One sees MlPsfflOP* then BIFs/BOP* and MeoaRC«^oaFLOPS/T«raFLOPSrfe^Fiops. 
o ISA 

ISA means the Instruction $et Architecture, 
o Link 

to A link is an element which may be physical or logical. A physical link la the physical connection lor 
joining elements Of untts> while in computer programming a Bnk ts en instruction or address that 
passes control and parameters between separate portions or the program, m muliisystemd a (ink is 
the connecfion between two systems which may be specified by program code Identify Nng the Dnk 
which may be identified by a real or virtual address. Thus generally a link includes ihe physical 

ss medium, any protocol, and associated devices and programming; it is both logical and physical. 
0 M FLOPS 

M FLOPS mssra (107*6 floating-point instructions per second, 
o MIMD 

mwd is used to refer to a processor array architecture wherein each processor m the array has its 
w own instruction stream, thus Multiple Instruction stream, to execute Multiple Data streams located one 
per processing element 
o Module 

A module is a program una that is discrete and identifiabls or a functional unit of hardware designed 
tor use with other components. Also, a collection of PCs contained in a single electronic clip is catted 
ss a module, 
o Nods 

Generally, a node is me junction al links. In a generic array ol PEe« one PE can be a nods. A node 
can also contain a collection of pes called a module, m accordance wish cut invemion a node is 
formed of an array of PMEs, and we refer to the set of PMEs as a node. Preferably a node is Q PMEs. 
<o o Node array 

A collection of modules made up of PMEs is sometimes referred to as a node array, is an array of 
nodes made up of modules. A node array is usually more than a few PMEs, but the term 
encompasses a plurality . 
o PDE 

<8 A PDE Is a partial differential equation, 

o PDE relaxation solution process 

POE relaxation solution process & a vary to solve a POE (partial differential equation}. Solving PDEe 
uses most of the 3uper computing compute power in the known universe and can therefore be a good 
example of the relaxation process. There are many ways to sofoe the pde equation and more than 

so one of the numerical methods includes the relaxation process. For example, it a PDE is solved by 
finite element methods relaxation consumes the bulk of the computing lime. Consider an example 
from the world of heat transfer. Given hot gas inside a chimney and a cold wind outside, how wffl the 
temperature gradient within the chimney bricks develop? By considering ihe bricks as tiny segments 
and writing an equation thai says how heal flows between segments as a function of temperature 

ss differences then the heat transfer PDE has been converted Into a finite element problem if we then 
say all elements except tttoss on the inside and outside are at room temperature while the boundary 
segments are at ihe hot gas and cold wind temperature, we have set up me problem to begin 
relaxation. The computer program then models time by updating the temperature variable in each 
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segment based upon the amount of heat the* flows into w otit of tr» s 

processing el ihe segjnenis In the modal before die eel or temperature variables acroee the chimney 
rate** 10 represent actual temperature attribution that would occur in ft* physical ttimney. B the 
objective waa to model gas cooling in me oNmney than toe elements would haw to extend la gas 

s equations, end the boundary conoittora on trie Inside would be linked to another fWe element model, 

and ft* process oonfriuee. Note that the had flew is dapendam upon the tamperaure difference 
between the s^rnem and lis neighbors- k thus uses ihe to&r-PB corniwmicalon paths to dtetfbute 
the temperature variables ft Is UHs near ri^ghbcr communication pattern or characteristic that makes 
PDE relaaon very applicable to peraJei compute 

jo o PICKET 

TWs is the element m an array of elements making up an array processor, it consists of: data flow 
(ALU REGS), memory, control, and the portion of Ihe communkation mcaix associated with tTte 
element The unit refers to a 1/Wh ol an array processor made up of paraJtel processor and memory 
element* wft their control and portion of array Iri^comrnumcetlon mechantem. A picket ^ a form 

is of processor memory etemerrt or PME. Our PME chip dsarji procaaaor logic can implement the 

picket logic described in rallied application* or have the logic for Ihe array of processors formed as a 
node. The term PICKET is similar to the corwnoniy used array term PE for processing element, and 
la m element of the processing array preferably comprised of a combined processmg dement and 
toed memory for processing bit paralel bytes or taformaHon In a dock cycle. The preferred 

* smboKSment consistmQ of a byte wide data a»» processor, 32k byiss or more of memory. primiSve 
controls and ties to commurtatfkjns with other pickets. 

The term "picket" comes from Tom Sawyer and his white fence, although it will also be 
understood forefcmany that a military picket line analogy m quite weil. 
o Picket Chip 

29 A picket chip contains a pluraBty of pickets on a single silicon chip, 
o Picket Processor system (or Subsystem) 

A picket processor Is a total system consisting of an array ol pickets, a communicaiion network, an 

VO system, and a SIMQ oontroler consisting of a microprocessor, a canned routine processor, and a 

mjcro-controter that runs tie array. 
an o Picket Archtedure 

The Picket Architecture is the preferred embodiment for the SMD architecture writ* features thai 

acconmwdete several diverse kmcta of problems tacfciding: 

- set associative processing 

- parallel numerically tensive processing 

39 * physical array processing similar to images 

o Picket Array 

A picket array is a collection of pickets arranged in a geometric order, a regular away, 
o PME or processor memory element 

PME is used tor a processor memory element We use the term PME to refer t> a single processor. 

<*> memory end W capable system eJement or unit that terms one of our parallel array processors. A 
processor memory element m a term which encompasses a picket A processor memory etemerrt Is 
1/ntb of a processor array which comprises a processor, Us associated memory, control interface, and 
a person of an array comrnunicatton network mechanism. This element can have a processor memory 
element with a connecfirfty of a regular array, as in a picket processor, or as part of a subarray. as in 

<* me murfrprocessor memory etemen* node *e have descrbed. 

° Rooting is iha assignment of a physical path by which a message m reach Its deeb'neiion. Routing 
assignments have a source or origin end a destination. These elements or addresses nave a 
temporary relationship or affinity. Often, rnessoge routing * based upon e key which ia obtained by 
so reference to a table of assignments. In a network, a destination is any station or network addressable 

untt addressed as the desk'nanon of irformafton transmitted by a path control address that identities 
the Ink. The destination field identities the destination with a message header destination code. 

0 3MD 

A processor array architecture wherein ai processors in the array are commanded from a Sinoje 
as InstfuctJon stream to execute Multiple Data streams located one per processing element. 

0 SMDMIMD or SlMD/MIMD 

SMDMIMD or SfttD/MMD is a term referring to a machine that has a dual function mat can switch 
from MIMD to &MD tor a period of time to handle some complex Instruction, and hue has two 



6 



EP 0 570 729 A2 



modes. The Thinking Machines, Inc. Connection Machine model CM-2 when placed as a front end or 
bach end of a MIMD machine permitted programmer* to operate different modes for execution of 
different parts of a fjrobfem, referred to sometimes e dual modes. These rnecftinee have existed since 
flaw and have employed a bua that mtwccrmects the master CPU with other proceesors; The master 
9 control processor would hate ihe capability of Interrupting the processing of other CPUs* Hie other 

CPU* could run independent program code. During an interruption, some provision must be made for 
checkpointing (ctoskig and saving current status of the controlled processors). 
0 SJMIMD 

SI Ml MO ia a processor array erchtecture wherein an processors (n the array are cornrnandeo from a 
70 Single Instruction stream, to execute Multiple Data streams boated one per processing element 

Wffltin this construct data dependent operations within each picket that mimic instruction execution 

are controlled by the a MO instruction stream. 

This is a SSngte Instruction Stream machine with the afctfly to sequence Multiple Instruction 

streams forte per Picket) using the SlMD instruction stream and opeiete on Multiple Data Streams 
;s (one per Picket). SI MIMD can be executed by a processor memory element system. 

SSD 

SISD is an acronym tar Single Instruction Single Data 
so o Swapping 

Swapping interchanges the data content of a swrege area wim that of another area of storage, 
o Synchronous Operation 

Synchronous operation in a MIMD machine is a mode of operation In which each action is related to 
an event t usuafly e clock}: it can be a specified event that occurs regularly in a program sequence. An 
so operation la dispatched to a number of PEs who then go off to indepertdently perform the function. 
Control is not returned to the controller urrtfl the operaSco is ccmpi&ied. rf the request is to an array of 
functions! unite, the request is generated by a comroffer to elements in the array which must complete 
their operation before control is returned to the controller, 
o TERAFLOPS 

30 TERAFLOPS means (fort 2 ftoating-pomi Instructions per second, 

o VLSI 

VLSI is an acronym lor very large scale integration {as applied to integrated drcuite). 
o Zipper 

A zipper is a new function provided. If allows for links to be made from devices which are external to 
3s ihe normal brorcortnection of an array configuration. 

BACKGROUND OF THE INVEMTlON 

\n the never ending quest tor raster computers, engineers are DnWng hundreds, and even thousands of 
io low cost mteroproceesore togesher in parallel to create super ^ per computer 8 that divide in order to conquer 
complex problems that stump today's machlnss, Such machines ere called massively parallel- We have 
created a new way to create massively parallel systems. The many irnprovemenjs which we have made 
should be considered against the background of many works of others. 

Multiple computers operating in parallel have existed lor decades. Early parallel machines included the 
* W-iAC which was started In the 1960s, iluac iv was bum in the 19703. other multiple processors include 
(see a partial summary in US. Patent 4,075,8*4 issued December 4, 1090 to Xu et at) the Cedar. Sigma- 1, 
the Butterfly and the Monarch, the Intel ipso. The Connection Machines, the Cattech COSMIC, Ihe N Cube, 
IBM's RP3> IBM's OFH , the NYU Ultra Computet, fre Intel Delta and Touchstone. 

Large multiple processors beginning with ILUAC have been considered supercomputers, Supercom- 
so puters wllh greatest commercial success have been based upon multiple vector processors, represented by 
the Cray Research Y-MP systems, the IBM 3080. and other manufacturer's mactirnes rncfuOng those of 
Amdahl, Hitachi, Fujitsu, and NEC- 

Massivery Parallel Processors (MPPs) are now thought of as capable ol becoming supercomputers. 
These computer systems aggregate a tarn* number of mfcmprocieseors with an imeroonneciien network 
58 and program them to operate in parallel. There have been two modes of operauon of these computers. 
Some of these machines have been mimd mode machines. 

Some of these machines have been SIMD mode machines. Perhaps me most commercially ecdaimed 
of Ihese machines has been the Connection Machines series 1 and 2 of Thinking Machines. Inc . These 
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have been esserwaMy SIMD mashines. Many of massively owaJlef rrujchtots have used rrOcroprccessora 
Interconnected in parallel to obtain fcair ecrxwrency or paraiel operators capebOty* Intel mtoropmoeseom 
ike i860 iwve been used by mtel and others. N Cub* has made such machines with Intel '388 
mioroprooessore. Ofcer macMnes haw been built with what te called the transputer" chip, tomos 

s Transputer IMS T800 b an example. The temos Transputer T800 is a 32 bit device wlih an Integral high 
speed floating point processor 

As an example of the ktod of systems that are bvW, sewral Inmos Transputer T800 chips each would 
have 32 communicatton Ink inputs and 32 IWc outputs Each cWp would have a singe processor, a small 
amount of memory, end commumcetion links to the local memory end to an external Interface. In addition, 

» m order to buUd up ttw system commonkalion link adaptors fike IMS C011 and 0012 would be connected, 
to addition switches, like e MS 0004 would provide, say, a crossbar swteh between *e 32 ink inputs and 
» link outputs to provide point-to-poW connection between additional transputer chips. In addMton, there 
wffl be special circuitry and interface chips for transputers adapting them to be used tor a special purpose 
tailored to the reoAibrements of a spectflc device, a graphics or disk controller. The Inmos IMS M212 \s a 16 

, 5 bit processor, wHh on cHp rosmory and communlc&tion Onks. R contains hardware and logic to control disk 
dives and can be used as a programmabfe disk ccntroSer or as a general purpose interlace, Si order to use 
the concurrency parallel operations) Inmos developed a special language, Occam* far the transputer. 
Programmere haw to describe the rwtwork of transputers directly In an Occam program. 

Some of these massively parallel machines use parallel processor arrays of processor chips which are 

20 imerconnected wi* different topoiogiea. The tianspuwr provides a crossbar network wish the addition of IMS 
COO* chips- Some u*wr systems use a hypercube ccrwecttoa Others use a bus or mesh to connect the 
mkroprocessors and there associated circuitry. Some have boon mterwrmected by circuit switch proces- 
sor* that use swSchee as processor addressable networks- Qeneratty, as «Hh the 14 RtSOSGOO* which 
were interconnected last ted at Lawrence Uvermore by wiring ihe rnachines together, the processor 

^ addressable networks have been considered as o»se-pjeined multfroceesera. 

Some very large machines are being built by Intel and nCube and others to attack what ere called 
"grand chsSengee" in date processing. However, these computers fire very expensive. Recent projected 
costs are In the order of $30.000>Q0O00 to $75^00.00000 (Tera Computer) for computers whose 
development has been tended by me U.S. Government to attack the "grand challenges*. These "grand 

30 chaJenges" would Include such problems as cSrnate modeling, fluid turbulence, pollution dispersion, 
mapping of the human genome aiKl ocean cffcutefcn, quantum c^iromodynamlcs, semicorxtector and 
supercomputer modeling, combustion systems, vision and cognmon, 

Aa a footnote to our background, we should recognize one of the early massively parallel machines 
developed by IBM, In our description we have chosen to use ihe term processor memory element father 

as than "transputer 11 to describe one of the eight or more memory unite with processor end l-O cepabtfroee 
which make up the array of PMEs in a chip, or node. The referenced prior art Transputer has on a chip 
one processor, a Fortran coprocessor, and a small memory, wrih an TO irtterfece. Our processor rnemory 
etement could apply to a transputer and to the PME of the RP3 generally. However, as will bo recognized, 
our ittte chip is sigrtflcantfy operant te many respects. Our Ihtte chip has many features described later. 

40 However, we do recognize that the term PME wee *st coined ior another, now more typical. PME which 
formed the basis for the rrwssfvety parallel machine known as the RP3, The IBM Research Parallel 
Processing Prototype (RP3I was an experimental perelei processor based on a MuBpte instruction Multiple 
Data WIMD) architecture. RP3 was designed and buffi at IBM TJ. Watson research Center in cooperation 
with the New York University Uttracomputar project. This work was sponsored in part by Defense Advanced 

« Research Project Agency. RP3 was cornpdsed of 34 Processcr-Memor^ Bements (PMEs) Interconnected 
by a high speed omega network- Each PME contained a 32-bit IBM *PC scientific* microprocessor, 32*6 
cache, a 4-MB segment of Ihe system memory, and an HO port The PME I/O port hardware and software 
supported iottieRzaSon, status acquisition, as well as memory and processor communication through shared 
to support Processors <I8Ps). Each ISP supports eight processor- ntemory elements through the Extended 

so i/G adapters (EiTOs), Independent of Ihe system network. Each ISP Interfaced to the IBM S/370 channel 
and the IBM Token-Ring network aaweUaa provsjng operator monster service. Each extended VQ adapter 
attached as a device to a PME ROMP Storage Channel <R$Q and provided programmable PME 
control/status signal 1/0 vie the ETIO channel. The ETK> channel is the 32-bit bus which Interconnected the 
ISP to the eiant adapter*. The EHO channel reted on a custom Interface protocol with was supported by 

59 hardware on the EDO adapter and software on the ISP 
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Problems addresssd by our APAP machine 



TT19 machine which we have cased tie Advanced Parallel Away Processoi (APAP) is « finegrained 
parauel processor which we believe is needed to address issues of prior designs. As ittusfrated above, there 

s have been many fine-grained (end also coarse-grained) processors constructed from both point design and 
off-the-shelf processors using dedicated and shared memory and any one of me many possible Intercon- 
nection schemes. To dan these approaches have all encountered one or more design and performance 
limitations. Each "solution'" loads In a different (fraction. Each has las problems. Existing parallel machines 
are dWtcutt to program. Each is not generally adaptable to various stees of machines compatible across a 

w range of applications. Each has its design limitations caused by physical design, tntaxonnaction and 
architectural issues. 

Physical issues 

;s 8ome approaches utifza a separate chip design for each of the various functions required in a 
horizontal structure, litase approaches suffer performance Gmilafions due to chip crossing delays, 

Other approaches integrate various functions togefter vatteaSy too a single chip, these approaches 
sutler performance ftmitattans due to the physical Km* on the number of togto gales which can be 
integrated onto a producible chip. 

Interconnection Issues 

Networks which interconnect the various processing functions are important to fine-grained parallel 
processors. Processor designs with buses, meshes, and hyparcubea have all been developed. Each of 

so these networks has inherent limitations as to processing capabfcty. Buses limit both *e number of 
processors which can be physteaffy infercormecied and the network performance. Meshes lead to large- 
oefcvork diameters which limit network performance. Hyperoubes require each node to have a large number 
of Interconnection ports; the number of processors which can be interconnected Is limited by the physical 
folWbuiput pins at the node. Hypercubes are recognized as having some significant performance gams 

30 over (he prior bus and mesh structures. 

Architectural Issuer 

Processes which are suitable for fine-grained parallel processors fail into two distinct types. Processes 
as which are functionary partifonable lend to perform better on multiple instruction, mu&pte data <MIMD) 
architectures* Processes which are not functionally p&rtftonebis but have multiple data streams tend to 
perform belter on single instruction, multiple data (81 MO) architectures. For any given application, there is 
fiksty to be some number of both types of processes. System trado-offis are required to pick the 
architecture which best suits a particular application but no single solution has been satisfactory. 

40 

SUMMARY OF THE INVENTION 

We have created a new way to make massively parallel processors and other computer systems by 
creating a new "chip 11 and systems designed with our new concepts. This appCcaiion is directed to such 

<s systems. Cwnponerns described in our eptfcattoris can be combined in our systems to make new 
systems. They also can be combined wfth existing technology. 

Think, our little CMOS DRAM chip of approximately H x 14 mm can be put together much lice bricks 
are walled in a building or paved to form a brick road. Our chip provides the structure necessary to build a 
"house", a complex computer system, by connected replication. 

so Placing our development In perspective, four little chips, each one alike, each one with eight or more 
processors errtoedded In memory with an internal array capability and stfernei I/O broadcast and control 
Interface, would provide me memory and processing power of thirty-six or more complex computers, and 
they could all be placed with compact hybrid packaging into something the size of a watch, and operated 
wfth very low power, as each chip only dissipates about 2 wans, with this chip, we have created many new 

56 concepts, and tfrose that we consider our invention are described in detail in the description and claims. 
Tne systems that can be created with our oornputer system can range from small devices to massive 
machines with PETAOP potential. 
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Our little memory chip array processor we call our Advanced Paraflet Array Processor. Though smaO, n 
to cample* and powerful. A typical dueler will have many cUtps* 

Many aspects and features of invention have been described in fMe and related aopfeettofts. These 
concepts and features of invention improve and are appicabto to computer cystoma which may not employ 

s each invention. We believe our concepts and features wil be adopted and used in me next century. 

This technfcsJ descrlpflon provides en overview of our Advanced Parallel May Processor (APAP) 
representing our new memory concepts and our sflbrt In developing a scalable massively parallel processor 
(MPPi that Is simple (very smal number of unique pert numbed end he* very high performance. Our 
processor uffltees to Its preferred embodiment a VLSI chip. The chip comprises &> PME mfcroccrnputsr*. 

w V represents me maximum number of array rfmenskroaJKy, The chip further comprises a broadcast and 
control interlace (BCH and internal and SKtemeJ oommunicadon peas between pmes on me chip among 
themselves and to the off chip system environment The preferred chip has 8 PMEs (but w» also can 
provide more) and one BCK Tne 2n PMEs and BCI are considered a node. This node can rune lion in aimer 
SIMD or mimd mode, in dual SIMQ/MODE, with esyrichronous processing, and vath SIMIMD functionary. 

is Since it is scalable, finis approach provides a node which can be the man building block for scalable 
paraiei processors of varying stae. The nrricrocomputw architecture of me PME provides fcu&y distributed 
message passing tofcrconnectJcn and control features wthm each node, or chip. Each node provides 
multfcte parapet rrOcrocomputer capability at the chip level, the rrtorcprocssor or personal computer level, at 
a workstation level, el specie] appScaUon levels which may be represented by a vision and/or avtontes level, 

20 and, when fully extended, to capability at grew* levels with powerful Gigaftop performance into the 
supercomputer range. The stmpfcHy is achieved by the use of a single highly extended DRAM Chip met is 
ropScatod Mo parallel dusters, TNs Keeps the part number count down and allows scaling copefciWy to the 
cost or performance need, by varying me chip count, then the number of modules, etc 

Our epproech enables us to provide e machine with etuibutes meeting the requirements that oYtve to a 

9 pereiei solution in a series of applications. Our methods of paraflelizetlon at the sub-chip level serve to keep 
weight, volume, and recurrmg and logistic costs down. 

Because our different size systems am aJ based upon a single chip, software toots are common tor all 
size systems. Hits offers the potential of development software (running on smaller workstation machines) 
that is interchangeable among el tevefe (workstason, aerospace, and supercomputer)- That advantage 

so means programm er s can develop programs on workstations while a production program runs on a much 
larger machine. 

As a result of our v/sH belsncsd design impismerrtaoon we meet today's requMwrterrts Imposed by 
technology, performance, cost and perception and enable growth of the system into Hte future. Since our 
MPP approach starts at the chip level, our discussion starts at the chip technology description and 

09 concludes with the superaornputer appttcafion description*. 

Physical, interconnection, and architectural Issues wW a* be addressed to the machine cirectly. 
Functions w« not only be integrated into a smgis chip design, but the chip design wiH provide fufwttons 
sufficiently powertol and ftercbfe shai to chip will be effective at processing, routing, storage and three 
classes of I/O. The interconnection network win be a new version of the hypercuoe which provides minimum 

<* network diameters without the iaput'outpuT prn and wireebifty timifatiens normally associated wHh hyper- 
cubes. The trade-off between 3MD and MIMD are eliminated because the design aaows processors to 
dynamically swift* between mimd end SIMD mode. This eliminates many problems which wil be 
encountered by appKcaHon programmers of "hybrid* rnacrsnes. In eddKlcn, me design wffl alow a subset of 
the processors to be in SIMD or MIMD mode. 

45 The Advanced PaieBei Array Processor (APAP) is a fine-grained parallel processor. It consists of conwi 
and processing secsons wrsch are pe/tronabte such tSat corilgurattons suitable for supercomputers 
through personal compufJng appacaUons can be saUeliadL In most configuration* ft would attach to a host 
processor and support the off toadrng of segments of the ttosfe workload Because t*e APAP array 
processing elements are general purpose carnputere, the pertrcvter type of wwWoed off-loaded will vary 

so depending upon the capabfltfes of the host For exampto, our APAP can be a module tor an IBM 3090 
vector processor mantram*. When aBached to a mainframe with high f#rfofmanoe vector floating point 
capability the task off-loaded might be sparse to dense matrix transtormarJons. Alternatively, when attached 
to a PQ persona] computer the off-loaded task might be numerically intensive 3 dimensional graphics 
processing, 

S3 The above referenced parent USSN 07/61 J £84, filed November i$, 1690 of Keifenderfer at al„ tilted 
r Porafi^ Associative Processor System" describes the idea of totegrating computer memory and control 
logfc wfihto a stogfe chip and repeating the combination within the chip and bufldlng a processor system 
out of reptcations of the single chip. This approach which is continued and expanded here leads to a 
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system which pravidus massively parallel processing capebKiy at the cost of developing end manufacturing 
only a single chip type white enhancing performance capability by reducing the chip boundary crossings 
and fine length* 

The above referenced parent US8N 07 flu ,§£4, filed November 13, 1600 iUusfrated utiiteatton of i- 
« dimension s ! I/O structu res( essentially a linear IflO) wUh multiple &MD PMEs attached Id tfrai structure within 
a chip. TNa emtxKfimsnt elaborates these concepts to dimensions greaier men 1. Trie description which 
Mows will be In rerms of ^dimensional I/O structures with & SIMD-'MMD PMEs per chip. However, that 
can be extended to greater dlmenstonafity or more PMEs pei cflmenslori as we will describe with respect to 
FIGURES 3, 9, 10, IS and 16. 
10 Our processing elerneru Includes a ruH I^O system including both data transfer and program Interrupts. 
Our description of our preferred embodiment woi be prfrnarDy described in terms of the preferred 4- 
dimensional W sbucturea with 8 8IMD/WMD PMEs per chip, which has specia) advantages now fen our 
view. However, thai can be extended to greater dimensionality or more PMEs per dimension as described 
in our parent application. In addition, for most applications we prefer and have made inventions In areas of 
/s greater drmensiona wfth hypercube rnterronnecttons, preferably with the modified hypercube we describe. 
However, in some applications a 2-ojmeratonai mesh mtsrcoririectlan oi chips win be applicable to a task at 
hand For instance, in certain miliary computers a 2 dimensional mesh weH be suitable and cost effective. 

TWs tfactosure extends the concepts from the Irterprocessor communication to the external 
Input/Output fodfltles and describes the Interfaces and modules required for control of the processing array, 
so in summary three types of VO. buef-proceeaor, processors to/from external, and broaoV^oontroi are 
described. Massfvefy parget processing systems require alt these types of tt> bandwidth demands to be 
balanced with processor computing eapebifity. Within the array these requirements wt) be satisfied by 
replicating a IB bit (reduced) instruction set processor, augmented with very fast interrupt state swapping 
capability. That processor is referred to es the PME illustrating the preferred embodiment oi our APAP- The 
es characteristics of the PME are completely unique when compared wtft the processing elements on other 
massively parallel machine*. It permits me processing, rouSng. storage and VO to be completely distributed. 
This is not characteristic of any other design. 

in a hypercube each PME can address as its neighbor, any PME whose address differs in any single bit 
position. In a ring, any PME can address as its neighbor the two PMEs whose addressee differ t i. The 
so modified hypercube of our preferred embodiment uuTued for the APAP combines these approaches by 
building hypercubes out of rings. The intersection of rings is defined to be a none. Each node of our 
preferred system has its PME, memory and I'O, and other feature? of the node, formed in a samicunductor 
sIScon low level CMOS DRAM chip. Nodes ere constructed from multiple PMEs on each chip. Each PME 
exists in only one ring of nodes. PMEs within the node are connected by sdolUonaJ rings such that 

39 communicattons can be routed between rings within the node. This leads to the addressing structure where 
any PME can step messages toward the objective by addressing a PME in He own ring or an aolacent PME 
vtfhin the node, in essence a PME can address a PME whose address dffiers by i in one in the f n»d bit 
Held of He ring (where d is the number of PMEs in Ihe ring) or the PME with (he same address but existing 
fen an adjacent dimension. "The PME effectively appears to exist in n sets of rings, while in actualty It exists 

40 only m one reel ring and one hidden ring totally contained wahin the chip. The dimensionality for the 
modified hypercube is defined to be the value n from the previous sentence. 

We prefer to use a modified hypercube. This Is elaborated tn the part of this application describing the 
technology. Finally. PMEs within a ring ens paired such that one moves data externally clockwise along a 
ring of nodes and the oflisr moves data externally counterclockwise along tie ring of nodes, thus dedicating 

u a PME to an external port. 

In our massively parallel machine. In our preferred embodiment die interconnecaon and broadcast of 
data and instructions from one PME to another PME in the node and externally of the nods id otter nodes 
of e duster or PMEs of a massively parallel processing environment are performed by a programmable 
router, allowing, recent fguratton and virtual flexibility to the network operations. This important feature is fully 

so distributed and embedded In the PME and allows for processor communication and data transfers among 
PMEs during operations of the system in SMD end MiMO modes, as well as in the smtmmo and 
SIM (MO modes of operation. 

VWtWn ihe rings each interconnection (eg is a pomKo-point connection. Each PME has a point-to-point 
connection wah the two neighboring PMEs m its tmg and with two neighboring PMEs in two adjacent rings. 

55 Three of these point-to-point connections are internal to the node, while me fourth poinMo-pomi connection 
is to an adjacent node. 

The massively parse*! processing system uses the processing elements, wan their local memory and 
interconnect topology to conned all processors to each other Embedded within the PME is cur tony 
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dfctributed I/O programmable router. Our system also provides an eddrikxi to the system which provides itm 
dbfflty to toad and unload all the processing eternento. With our zipper we provide a method for loading and 
unSoacBng of tte array of PEs and thus enabte impferootetion of a fast I/O along an edge of the 
rings. To provide for external interface I/O any subset of the rings may be broken {un-zipped across some 

5 dkmenslon(s)) *lth the resutam broken paths connected bo the externa) Interface. The co-pending appflca- 
fion entitled "APAP K> ZFPER", filed concurrently herewith, USSN .filed May 22.1992. 

describes our 'zipper* in additional detail The 'zipper* can bo applied to only the subset of Inks required to 
support the peak external WO load* which in all convocations considered so far leads to Its being applied 
only to one or two edges of the physical design, 

10 The final type ol UO consists of date mat must be broadcast to, or gathered from all PMEs. plus data 
which is too speciafeod to fit on the standard buses. Broadcast data includes commends, programs and 
data. Gathered data is primarily status and monitor functions white diagnostic and test functions are to 
spedaized elements. Each node, in addition to the included set of PMEs, contains one Broadcast and 
Control interface (BCfl section. 

rs Consider PMEs interconnected in a rnodlned 4 dimensional Irypercube network. If each ring cortelns 16 
PMEs. (hen the system *18 have 32.766 PMEs. The nelwork clameter is 16 steps. Each PME contains In 
SW the router and recon^tjrattori SAW to support a particular outgoing port Thus* software routing 
provides the capability to reconfigure in the event of a faulty processing element or node. Inherent to a 4d, 
25 Mhz network design wtlh byte wide half duplex rings Is the provision for 410 gigabytes per second peek 

so internal uandwidm. 

The 4 dimensional hypercube leads to a particularly advanrageous package. Bghc of the PMEs 
(including date flow, memory and 10 paths and controls) an? encompassed in a single chip. Thus, a node 
will be a single chip metaling pairs of eternente along the rings. The nodes are configured together in an 8 
X 8 array lo make up a duster. The tufly populated machine b built up of on array of 0 X 8 clusters fo 

29 provide the maximum capacity of 32J68 PMEs. 

Each PME is a powerful microcomputer having significant memory and I'O functions. There is mutiibyte 
date flow wifrn a reduced instruction set (RISC) architecture Each PME has id bit internal date flow and 
eight levels of program interrupts wWh the use of working and general registers to manage date flow. There 
is a circuit ewnched and store and forward mode tor to transfer under PME software control The SIMD 

3D mode or MIMD mode Is under PME software control. The PME can execute RISC Instructions from either 
me BOI in a 81 MD mode, or from its own main memory in MIMD mode. Specific RISC instruction code 
points can be reinterpreted to perform unique functions in the SIMD mode. Each PME can implement an 
extended Instruction Set. Architecture and provide routings which perform macro level Institutions such as 
extended precision fixed point arithmetic* Boating point arithmetic, vector arljhmette, and the Bke. This 

jo permte not only complex math to be handted but image processing activities for display or image date in 
multiple dimensions (2d and 3d images) and tor multimedia applications. Tiie system can select groups of 
PMEs for a function. PMES assigned can elocate selected date and instructions for group processing The 
operations can ba externally monhorod via the BCL Each BO has a primary control input, a secondary 
control Input end a states monitor output for the node. Within a node the 2n PMEs can be connection for a 

40 binary hypercube oemmurticenon network within the chip. Commurxcaaon between PMEs is controlled by 
to bfte in PME control registers under control of PME software. This permits (he system to have a virtual 
routing capability. Each PME can step messages up or down Its own right or to Its neighboring PME In 
aimer of two adjacent rings. Each interface between PMEs Is a point-to-point connection. The 00 ports 
permit ofrchip extensions of me internal ring to adjacent nodes of the system. The system is built up of 

<a reptcetions of a node to form a node array, a cluster, and other configurations. 

To complement our system's SIMD, MMO, SlMO/MIMD and SIM WD functionary, our development we 
have provided unique operational modes. Among our SIMD/MIMD PMFa unique modes are me new 
fonctfonei features referred to as me "store and forward / circuit switch" functions- These hardware njncSons 
comptemented with the on chip communication and programmable Interna) and external I/O rcuflng provides 

so me PNE with very optimal date transferring capability. In preferred mode of operation the processor 
memory te generally the data sink tor messages and data targeted at the PME in the store and forward 
mode. Messages and data not targeted lor the PME are sent directly to the required output pore when In 
circuit switched mode. The PME software performs the selected routing path while giving the PME a 
dynamically selectable store and forward t circuit switch functionary. 

S3 Among (he advances we have provided is a fully distributed architecture for PMEs of a node. Each 
node has 2n processors* memory and tf>. Every PME will provide very flexible processing capability with 
18 bit data flow, 64K bytes of focal storage, store and forw&d/clPCuii switch logic, PME to PME 
corrtmunication. SJIVHVMIMO switching capabilities, programmable routing, and dedicated floating point 
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assist logic. The organization of awry PME ami Hs communication paths with other PMEs vtrthin the same 
chip to minimize chip crossing delays. PME functions can be independently operated by the PME and 
integrated with functions in to node, a cluster, and larger arrays. 

Our massively parallel system is made up of nodal building Mocks of muEi processor nodes, clusters of 

3 nodes, and arrays of PNEs already packaged In clusters. For control of tries© packaged systems wha 
provide a system array director union w*th the hardware controllers performs the overeD Processing 
Memory Element tPME| Array Controller functions to the massively parallel processing environment The 
Director comprises of three functional areas, the Application Interface, the Cluster Synchronizer, and 
normaly a Cluster Controller. The Array Director wiK have the overall control of the PME array, usmg trie 

w broadcast bus and our zipper connection to steer data and commands to all of the PMEs, The Array 
D tractor function a as a software system impacting witri the hardware to perform the role as the she* of the 
APAP operating system. 

Hie interconnection for our PMEs tor a massively parallel array computer StMDvMIMD processing 
memory element (PME> interconnection provides the processor to processor connection in the massively 

;s parallel processing environment Each PME uafees our fully distributed interprocessor communication 
hardware from the on-chip PME to PME connect ton. to the off-chip I/O facilities which support the chkp-to- 
chlp Inter connection Our modified topology SmUs our cluster to duster wiring while support the 
advantages of hyporcube connecttons. 

The concepts which we employ tor a PME node are related to the VLSI packaging (echntquas used for 

so the Advanced Parallel Array Processor (APAP) computer system disclosed here, which packaging features 
of our invention provide enhancements to the manufacrurktg ability of The APAP system. These techniques 
are unique m the area of massively parallel processor machines and wffl enable me machine to be 
packaged and configured in optimal subsets that can be built and tested. 

The packaging techniques take advantage of rhe eight PMEe packaged in a single chip and arranged in 

so a N-dmencional modified hypercube configuration. This chip level package or node of the array is the 
smallest building block in the APAP design. These nodes are ften packaged in an 6 x S array where the +• 
X and the +-V makes rings wiWn ihe array or duster and the + -W> and +-2 are brought out to the 
neighboring clusters. A grouping of cfcisers make up an away. The intended applications tor APAP 
computers depend upon the particular confeguraaon and host Large systems attached to rnainframes with 

30 effective vectorised Hosting point processors mlgfti address special vedorizabte probleme - such as weather 
prediction, wind tunnel simulaaon, turbulent fluid modeling and tmac element modeUng. Wiiere these 
problems frivotve sparse mar/ices, significant wot* must be done to prepare the dais tor vectorized 
arithmetic and ifcewlse to store results. That workload would be off loaded to the APAP, In rniermecBate size 
systems, the APAP might be dedicated to perfonning the graphics operations associated with visualization, 

as or with some preprocessing operation on incoming data 0^ performing optimum assignment problems in 
military sensor fusion applications)- Small systems attached to workswuens or PCs might serve as 
program rner development sftaons or might emulate a vectorized floating point processor attachment or a 
3d graphics processor. 



a BRIEF DESCRIPTION OF THE DRAWINGS 

Ra 1 shows a parallel processor processing element like those which would utilize old technology. 

FIG. 2 shows a massively parallel processor building block m accordance with our invention, 
representing our new chip design. 
* HQ. 3 illustrates on the right side the preferred cWp physical cluster layout for our preferred 
embodiment of a chip single node fine grained parallel processor. There each chip b a 
scalable parallel processor chip providing 5 MIPe performance with CMOS DRAM memory 
and logic permitting air cooled impiementaiion of massive concurrent systems. On the left 
scte of Figure 3, there is illustrated the replaced techno togy. 
30 FIG. 4 shows a computer processor funcQcnaJ block diagram In accordance with the Invention. 

FIG. 5 shows a typical Advanced Parallel Array Processor computer system catffjuretion, 

FIG 6 shows a system overview ol our fine-grained parallel processor technology In accordance 
with our invention, Illustrating, system build up using replication of the PME element which 
permits systems to be developed with 40 to 193,8*0 mips per^rmance. 
99 FIG. 7 illustrates she hardware for the processing element (PME> data Sow and local memory in 
accordance with our invention, wnoe 

FIG a illustrates PME data flow where a processor memory element is configured as a hardwired 
general purpose computer that provides about 5 MIPS fixed point processing or A IVffiops via 
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programmed control floating point operations. 
FIG. 9 shows Ihe PME to PME connection (binary hypercube) and data paths that can be taken In 

accordance wfN cm invention* while 
FIQ. 10 ilusfrates node in^connectons tot tie chip or node which has 8 PMEe, each of which 
s manages a single external port and permits distribution of me network control function end 

eliminates a functions hardware port botiteneck. 
FIG. II Is a block diagram of a scalable parallel processor chip where each PME Is a 16 bit wide 

processor 32 K words of local memory and there to I/O porting far a broadcast port 

which provtoss a ccmroWer-to-sdl Interface while external porta are bWlrectional polrrt-to-pow 
jo Interfaces permhtjng ring torus connections within the chip aid externally. 

Fia 12 shows an array director in the preferred embodiment 

FIG. 13 In part (a) fcrustrates mo system bus to or from a dusfcr array coupling enabling loading or 
unloading of the array by connecting the edges of clusters to the system bus (see FIGURE 
14K In FIGURE 13 m part (bl there is the bus toflrom the processing element portion, 
is FIGURE 13 fflustates how multiple system buses can be supported with multiple clusters. 

Each duster can support 50 to 57 Mbyte/a bandwidth. 

FIG, 14 shows a'zfppeT connection for fast I/O connection. 

FIG. id shows an 0 degree hypercube connection Illustrating a packaging technique In accordance 
wttfi our Invention applicable to an 8 degree hypercube. 

ae> FIG. 19 snows two Independent nods connection* in the bypeicube. 

FIG. 17 shows the Bitonic Sort algorithm as an example to ifostraze the advantages of the defined 

oiMDiMMD processor system, 
FIG. id Illustrate* a system block disc/am lor a host attached I&ge system w*h one application 
processor interface iHuatatsd. This lUustrauon may also be viewed whti tie understanding 

9 that our invention may be employed in stand alone systems which use multiple application 

processor interfaces. Such interface* m a FIGURE 18 configurafion wiH support 
DASO&raphics on aD or many dusters, Workstation accoteratons can eliminate the host 
application processor enferface (API) and cluster synchronizer {OS) illuseated by emulation. 
The CS fe not required in aa instances. 

30 FIG. 19 iluslrates Ihe software development environment tor our system. Programs can be prepared 
by and executed torn the ftost ^ppficafton processor. Bom program and machine debug is 
supported by the workstation based console iflustrafcd here and in FIGURE 22. doth of these 
services wtl support appflcaitona operating on a real or a simulated MM P. enabling applica- 
tions to be developed at a workstation level as wen as on a supercomputer formed of tlie 

as APAP MMP. The common software erwironment enhances programrrvabiilty and distributed 

usage. 

FIG. 20 ilustretes the programming levels which are permsted by the new systems. As caterers 
users require more or less detailed knowledge the software system Is developed to support 
this vartaoon. At the highest levsl the user does not need to know the architecture Is Indeed 
<*a an MMP, The system can be used with existing language systems for partitioning of 

programs, such as parallel Fortran. 

FIG 21 ilustretes the parallel Fortran compiler system for the MMP provided by the APAP corrigura- 
tfcns described. A sequential to parallel compiler system uses a combination of existing 
compter capacity with new data aSocation functions and enables use of a petitioning 
45 program nice FortranD. 

FIG. 22 ilustretes the worfestztion applcatfon of me APAP, where the APAP becomes a workstation 
accelerator. Note that the unit has the same physical ate as a RISC3000 Model 530, but 
this model now contains an MMP which is attached to the vnorkstptkm via a bus extension 
module iflustrsted, 

so FIG. 23 lluslraies an nppJteation far an APAP MMP module for an AWACS military or commercial 
application This is a way oi handing efficiently the classical distributed sensor fusion 
problem shown in FIGURE 23, where me observation to track matching is classically done 
wfr well know algorithms like nearest neighbor, 2 dknensionsi Bnear assignment (Munkes-). 
probabMlc data association or multiple hypothesis testing, but these can now be done in an 

S3 improved manner as iusirated by FIGURES 24 end 25, 

FIG. 24 ilustretes how the system provides the abSity to handle n-dlmensional assignment problems 
in realtime. 

FIG. 25 tlustrates processing flow for an rKirnensional assignment problem utilising an APAP. 
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RG. 26 illustrates the expansion una provided by the system enclosure described showing how a 
ufttt can provide 424 Mffopfi or 6120 MIPS using only 8 id 10 extended SEM-E modules, 
providing me perforrnance comparable to that of spedalzed signal processor module in only 
.6 cubic test This system can become a SIMD massive machine* with 1024- parallel 
s processors performing two Ml Bon operations per second (GOPS) and can grow by adding 

1024 additional processors and 32MB addrBonal «icr£g& 
RO. 27 illustrates ihe APAP packaging tor a supercomputer. Here is a large system of comparable 
performance but much smeller footprint then other systems. It can be built by replicating the 
APAP cluster wtoln an enclosure flf<a tnoes used for smaller machines, 
ro We have provided, as part of the description, Tables dustraUng Ihe hardwired instructions for a PMG, in 
wWch Table 1 illustrates FtoeoVptfr* arithmetic instructions: Table 2 illustrates storage to storage tostruc- 
aons; Table 3 illustrates logical insfrucHons; Table 4 illustrates shift Instructions: Tab* 5 illustrates branch 
instructions: Table 6 illustrates the status switching instructions: and Table 7 illustrates the input/output 
Instructions. 

!& (Note: For corrveruence of ittuetnSon m ttis formal patent drawings, FIGURES may be separated in parte 
and as a contention we place Ihe top of the FIGURE as the first sheet with subsequent sheets proceed! rig 
down and across when viewing the FIGURE* in fie event that multiple sheets are used) 

Our detailed description follows with parte explaining the preferred embodiments of our invention 
provided by way of example. 

so 

DETAILED DESCRIPTION OF THE INVENTION 

Turning now to our invention in greater detail, it wlH be seen from FIGURE I, which illustrates the 
existing technology' level, illustrated by (he transputer T000 chip, and representing similar chips for such 

£* machines as the illustrated by the Touchstone Delta (iaeo% N Cube ('386). and others, When FIGURE 1 te 
compared with ihe developments here, it nil) be seen mat not only can systems tike the prior systems be 
substantially improved by employing our invention, but also new powerful systems can be created, as we 
wfll describe. FIGURE i*s conventional modem microprocessor technology consumes pins and memory. 
Bandwkft is limited and inter-chip communication drags the system down, 

3D Tha new technology leapfrog represented by FIGURE 2 merges processors, memory, M> into multiple 
PMEs (eight or more 16 bit processors each of which has no memory access delays and uses all *e pins 
tor networking) formed on a single low power CMOS ORAM chip. The system can make use of ideas of our 
prior referenced ffiecrosures as wel as Invention separately described In the applications filed cortcurrerrtly 
herewith and applicable to the system we describe here. Thus, for this purpose they are incorporated herein 

35 by reference. Our concepts of grouping, autonomy, transparency, zipper interaction, asynchronous SIMD, 
Suvumd or Simd/mimd. can aU be employed with (he new technology, even rnough to lesser advantage 
they can be enjoyed in the systems of the prior iscfcnetogy and in combination with our own prior muitrpts 
picket processor. 

Our picket system can employ the present processor. Our basic concept is frat vre have now provided 

40 a repticabte brick, a new basic buftfing block for systems van our new memory processor, a memory unit 
having embedded processors, router and NO, This bask buftfrvg block is scalable. The basic system which 
we have implemented employs a 4 Meg. CMOS DRAM. It is expandable to be used in larger memory 
configurations, with 16Mbit DRAMS, and d4Mbit chips by expansion. Each processor is a oats array, Wim 
denser deposition, many more processors, at higher clock speeds, can be placed on the same chip, and 

<$ using gates and edrjfftonai memory win expand the performance of each PME Scaling a single part type 
provides a system framwork and axcrtiecture which can have a performance wel into the PETA0P range. 

FIGURE 2 illustrates the memory processor which we can the PME or processor memory element in 
accordance w&h our preferred embodiment The processor has eight or more processors. In the pictured 
embodiment there are eight The chip can be expanded (horizontally) to add mere processors. The chip 

so can as preferred, retain the logic and expand the DRAM memory with additional ceOs Bnsarly (vertically). 
Pictured are 16 - 32k by 9 tat sections of DRAM memory st rounding a field of CMOS gate array gates 
which implement & replications of a id bit wide dsna flow processors. 

Using IBM CMOS low power sub-micron IBM CMOS deposition on silicon technology, it uses selected 
silicon with trench to provide significant storage on a small chip euriaeo. Our memory and multiple 

65 processors organized interconnect b made with IBM's advanced art of making semiconductor chips. 
However* rt will be recognised that the little chip we describe has about 4 Meg. memory. R is designed so 
that as 16 Meg. memory technology becomes stable, when improved yields and methods of accommodate 
ing defects are certain, our litUe chip can migrate to larger memory sizes each 6 bits wide without changing 



15 



EP0570729A2 



the logic. Advances in photo and X-ray Smogrcphy keep pushing minimum feature size to well below 3 
microns. Our design envisions mora progress. These advances wffl permit ptaement of very large amounts 
of memory with processing on a s^e s»con chip. 

Our device is a 4 MEG CMOS ORAM beloved to be tta ^general memory oWp wRh extensive room 

s 1or logic 18 replications of a 32k by 9-bli DRAM macro make up Ihe memory array. The MAM has 12DK 
cala ft anocatss with sigrtffcant surface area for application k&c on the chip, with trip* level metal wring. 
The processor fogto eels are preferably gate array cete. The 35 ns or teas DRAM access time (ranches the 
pmcessor cycle time. This CMOS Implementation provides logic density tor a very effecflve PE (picket) end 
does so while dissipeting U wstfe tor the logic The separate memory section o? the chip, eac* 32K by 9 

?o bits, (wim expansion not changing logic) surrounds the field of CMOS gate array galas reprosecuting 120K 
eels, and having the logic described to other figures. Memory Is bordered and with a separated power 
source dissipates £ waits. In providing the corotrirung of signficant amounts or logic on the same silicon 
substrate wim significant amounts of memory problems involved with lha eJectrioal noise incompatibility ol 
logic and DRAM have been overcome. Logic tends to be very noisy «tile memory needs relative quiet to 

?5 sense the mil toft size signets that result from reading the cats of DRAM. We prefer to provide trenched 
triple metal layer silicon deposition, wKh separate barriered portions of the memory crap devoted to memory 
and to processor logic with voltage and ground isoiaSon, and separate power distribution and barters, to 
achieve oampasbRKy between logic and DRAM. 

2> APAP Sy&tem Overview of Prgawgd Embodiments 

litis description introduces the new technology in the following order 

1, Technology 

2, Chip WW description 

as 3, Networking and system build up 

4. Software 

5. Applications 

The initial sections of the detailed description describe how 4-Meg dram low power CMOS chips are made 
to include 8 processors on and as part of the manufactured PME DRAM chips each supporting: 
a> i. 16 bit 5 M IP dataflows, 

Z independent instruction stream and interrupt processing and 

3, 8 bit (plus parity and ccwolsj wide externa! port end interconnection to 3 oih*r on chip processors. 
Our invention provided multiple functions which are Integrated into a single chip design. The chip will 
provide PME functor which an) powerful and flexible *nd sufficiently so such that a chip having scalability 
» win be effective at processing, routing, storage and three classes of K>. This chip has integrated memory 
and corwi logic within the single chip to m*e the PME, and this comw nation is repeated within (he chip, 
A processor system is butt from replications of the single chip. 

The approach partitions the km power CMOS DRAM, it wis* be formed as multiple word length 09) hit 
by 32K secaons, assodefng one section with a processor. (We use the term PME to refer to a single 
<* processor, memory and I/O capable system unit.) This parSicning leak to each ORAM chip being an 8 
way 'cube connected* MfMO parallel processor with 8 byte wide independent irrfcercormecBon ports- (3ee 
FIGURE 8 for an illustration of a replcatton of fine-grained parallel technology, ftustratmg replication and the 
ring torus possmtittes.) 

The software description addresses several distinct program types. At me lowest level processes 
* tmerfece the user's program (or services cased by the application] to the detailed harefcw h/vy needs. 
This level Includes *e tasks required to manage the tO end interprocessor syndironteaio* and is what 
might be called a nacroprogram for die MPP. An intermediate level of services provide for bom mapping 
applications {developed won vector or mafrbc operations) to the MPP, and also control, synchtcnizatlon, 
startup, dtagnoetfc functions; At the host level, high order languages era supported by library functions that 
so support vectorized programs with either simple automatic data aiocatlon to (he MPP or user tuned data 
aflocattoa The murtMavei software s/W approach permits applications to exploit different degrees of control 
and optimization within a single program. Thus, a user can code application programs without undoretsndmg 
the architecture detafl whBe an optimizer mighi tune at the microcode level only the small high usage 
Kernels of a program. 

53 Sections of our description that describe 1024 element 6 QIPS unite and a 55,788 element 184 GiP8 
unit illustrate the range of possible systems. However, ttose are not the limits; both smafler and larger units 
are feaslbiev Tnese per&cufar sizes have been selected as examples because the small unit is suitable to 
microprocessors (accelerators), personal computers, workstation and miliary applteaSons (using of coarse 
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different packaging techniques}, while the larger unit Is iilustraive of & mainframe application as a module or 
complete supercomputer system. A software description win provide examples of other challenging work 
that might be effectively programmed on each of fie »u$tr*tive systems. 

s PME DRAM CMOS - A BASE FOR A MULTIPROCESSOR PME 

FK3URE 2 illustrates our technology improvement at ihe chip technology level. This extendable 
computet organization la very cost and performance efficient over the wide range of system sizes because 
It usee only one chip type. Combining m memory and processing on one chip eliminates the pins 

io dedicated to the memory bus and their associated reliab&ty and performance penalties. Repfcatbn of our 
design within me chip makes It economically feasible to consider custom logic designs for processor 
subsections- Replication of the chip within the system leads to large scale manufacturing economies. 
Fmairy, CMOS technology requires low power per Ml P. which in turn minimizes power supply and cooling 
needs. The cbfc architecture can be programmed for multiple word lengths enabling operations to be 

/5 performed that would otherwise require much larger length processors. In com tycoon these attributes 
permit the extensive range of system performance. 

Our new technology can be compered with a poasfote extension of me old technology it overlaps, it & 
apparent that the advantages of smaller features have been used by processor designers to construct more 
complex chips and by memory designers to provide greater replication of the simple element D the trend 

20 continues one could expect memories to get four times as large while processors might exploit density ta 

1. include multiple execute units with instruction routers, 

2. increase cache sizes and easoctetrve capability and/or 

3. Increase instruction look ahead and advance computation capability. 

However, these approaches to ihe oki technology illustrated by FIGURE t all tend to dead end. 

es Duplicating processors leads to Rneerly increasing pin requirements bus pins per chip is fixed. Better cache- 
ing can only sxploa the arjpacatJon*9 data reuse pattern. Beyond mat, memory bandmdft becomes the Gma. 
Application data dependencies and branching limit foe potential advantage of look ahead schemes. 
Additionally, it is not apparent that MM> appficabons with fine-grained parallelism need 1. 4 : or 16 
Msgsword memories per processing unit Attempting to share such large memories between muhrpte 

so processors results In severe memory bandwidth limitations. 

Our new approach is not dead ended. Wo combine bom significant memory and I/O and processor into 
a single chip, as illustrated by the FIGURE 2 and subsequent illustration and description. It reduces part 
number requirements and eliminates the delays associated with chip crossing, More importantly, this 
permits all the chip's VO pins to be defeated to biferproceseor oornmunteatJon and thus, maximizes 

as network bandwidth. 

To implement our preferred ernbodimem inustreted te figure 2 we use a process dial is sveBabte now, 
using IBM lev/ power CMOS technology. Om illustrated embodiment can be made with CMOS DRAM 
density , to CM08 and can bo Implemented in denser CMOS. Our illustrated embedment of 32K memory 
ce09 tor each of & PHEs on a cWp can be Increased as CM08 becomes denser. In our emtxxftment we 

40 utilize the real estate end process technology for a 4 MEG CMOS DRAM, end expend mss with processor 
replication associated wfch S2K memory on she chip Kseff, The chip, it wiH be seen has processor, memory, 
and MO in each of the chip packages of the cluster shown in FIGURE a Within each package la a memory 
wth embedded processor element, router, ana VQ. ail contained to a 4 MEG CMOS DRAM behaved to be 
the first general memory chip with extensive room for logic. It uses selected silicon with trench to provide 

4s significant storage on a small chip surface. Each processor chip of our design alternatively can be made 
wfth 16 replications of a 32K by 9 bit ORAM macro {$5/30 rts) using 47 micron CMOS logic to make up the 
memory array. The device is unique in that H allocates surface area for 120 K coos of application logic on 
the chip, supported by the capability of triple level metal wiring. The multiple cards of the old technology is 
shown crossed out on the left side of FIGURE 3, 

oa Our basic repdcable element brick technology Is an answer to the old technology. IT one considered the 
'%*r technology on the ten of FIGURE 3. one would see too many chips, too many cards, and waste. For 
example, Today's proposed teraftop machines Thar oihsrs offer would have literally a million or more chips In 
them. With todays other technology only a few percent of these chips, at best, are truly operations 
producers. The rest are "overhead 1 " frypicaiiy memory, network interface, etc.). 

58 It will become evident that it is not feasible to package such chips, in such a large number, in anything 
that must operate in a constrained eiwkronmerit of physical size. {How many could you ft* in a $mei area of 
a cockpit?} Furthermore, &uch proposed feraHop machines of others, already huge, must scste ia> 1000* 
times to reach the petaop range. We have a solution which dramatically decreases the percent of non- 
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operations producing cttps. W© provide increased bandwidth. We provide this within a reasonable network 
olrnerwkmoRy . VWtti such a brick technology, where memory becomes the operator, and networks are ueed 
for passing controls, where operation* producing chips era tometfcdliy increased* fo addition, the upgrade 
otemascaly reduces the number of different types of chips. Our system is designed for scate-up, without a 

s requremem (or specialized packaging, cooling, power* or environmental constraints. 

Vfcto our brick technology. uaOzing inroad of separate processors, memory unto wth bum in 
processors and network capability, ihe configuration shown In BGURG 3, representing a card, wim cNps 
which are fwi compatable with current 4Mb* DRAM cards at the connector level Such a single card could 
hoW, wfth a design point ot a basic 40 rnfo per chip performance level, 32 chips, or 1280 mips. Four such 

70 cards would provide 5 gips. The workstation configuration which k aiustrsted woupd preferably have such a 
PE memory array, a duster controller, and an IBM RISC SystemtfOOO which has sufficient performance to 
run and monitor execution of an array processor apptcabon developed at ma wortsfaiion- 

A very pale efficient processor can be ueed In ihe processor portion. Such designs for processors have 
been employed, but never within memory. Indeed* in addition, we have provided the ebfilty to mix MWD 

re and SIMD basic operation provisions. Our chip provides a 'broadcast bus* which provides an alternate path 
Into each CPUs detraction buffer. Our cfcster controller issues commands to each of the PEs in the PMEs, 
and mesa can be stored in the PME to control their opereiion in one mode or another. Each PME does not 
have to store sn entire program, but can store onfy those portions applicable to a given task at various 
times during proc es s ing of an applceilon. 

as Given the baste device one can elect to develop a single processor memory combination. AttemaaveJy. 
by using a more simple processor and a subset of ihe memory macros one can design for either fc <k 6 or 
16 relocations of ihe basic processing element (PME>. The PME can be made simpler either by adjusting 
the dataflow bandwidth or by subs«u*ng processor cycles for functional accelerators. For most embodi- 
ments we prefer to make 8 repfecaSons of ihe bask; processing element we describe, 

* Our application studies have indicated that for now the moot favorable answer Is 8 replications of a ie 
bft wide dafe flow and 32K word memory. We conclude this because: 

1 . 16 bit words permit single cyc& feich of instructions end addresses. 

2. 8 PMEs each wtth an external port permits 4 dimensions* torus taterconnections. Using 4 c* 6 PMEs 
on each ring leads to modules suitable for the range of targeted system performances, 
3D 3. 8 extemd ports requires about 50% of toe chip pine, providing sufflctenl remainder lor power, ground 
and common control signals. 

4, 3 Processors implemented in e 64 KByte Main Store 

a allows for a register based archtteclura rather man a memory mapped architecture, and it 
b. forces some desirable but not reojuired accelerators to be implemented by mutSple processor 
as cycles. 

This last attribute is important because « permits use of the developing logic density increase. Our new 
accelerators (ex fleeting point arithmetic unit per PME) ate added as crap hardware wHhout affecting 
system design, pins and cables or application code. 

The resultant chip layout and si*? x 14.33 mm) Is shown in FIGURE 2, and FIGURE 3 shows a 
to cluster of such cWpe. which can be packaged in systems ite those shown ir> later TOURE6 tor stand done 
units, workstations which slide next to a workstation host wfch a connection bus, in AWACs applications, and 
tn sivercorrtputere. This chip technology provides a number of system level advantages. It permits 
development of the scalable MPP by basic replication of a single part type. The two DRAM macros per 
processor provide sufOcient storage for both data and program. An SRAM of equivalent size might consume 

* more man io limes more po^er. This advantage permta MIYID machine models rather than the more 
Iroited SIMD models characteristic of rnachtnes with single chip processor/rnernory designs- The 35 ns or 
less DRAM access time matches Ihe expected processor cycte time. CMOS toglo provides the logic density 
for a very effective PME and dees so while dissipating only 1.3 watts. (Total chip power is 1.3 + £ 
(memory; * 22 w,) Those features In turn permit using the chip In MIL applications requiring conduction 

so cooflrtg. (Air cooing in nen-ML applications Is significantly easier.) However, the air cooled erntotimera can 
be used for wortstation and other environments, A standalone processor might be configured wnh an 80 
amp - 5 wit power supply. 

Advanced Parallel Array Processor <APAP) bulking bfocfes are shown In FIGURE 4 and In FIGURE 5. 
FIGURE 4 Hlustretes me ajneiondt block taagram of me Advanced Parallel Array Processor. Multiple 

55 application interfaces 150, 160, 170, 180 exist for the application processor 100 or processors I 10, 120, 
190. FIGURE 5 fflustrates the basic building bfocfes met can be configured into different system block 
diagrams. The APAP, in a maximum configuration, can incorporate 32768 idsntictl PMEs. The processor 
consists of the PME Array 280. 290. 3 J0> an Array Director 250 and an Application Processor Interface 



18 



EP 0570 729 A2 



3S0 for the application pnxwstor 200 or processors 210, 220, 230. The Array Director 250 consists at three 
furwttonai unite: Application Pnoeeescr Interface 200. duster eyrwhrcniier 270 and duster ConsroSor 270. 
An Ai ray Director can perform the functions of the array controller of our prior linear picket system tor SJMD 
operations with MIMD capability. The okistsr corwdier 270. along with a est of 64 Artsy dusters 200. 200. 

s 300, 310. tie. duster of 512 PMEs). a the basic building block of the APAP computer system. Tha 
stems* its of me Array D&ecsor 250 permit conjuring systems wrft a wide range of cluster replications. This 
modvtar by based upon strict repucafion of boflh processing and control elements Is unique to this massively 
pereDeJ computer system. In aotftlon, the Appficatton Processor interface 200 supports the Test/Debug 
device 240 which will accomplish important design, debug, and monitoring functions. 

10 Controllers are assembled wtlh a wad-defined interface, e.g. IE Ms Mfcrochennei, used in ofcer systems 
today* including comroaers wfch ieeo processors* Raid programmable gate arrays add functions to the 
controller which can be changed to meet a particular configuration's regalements (how many PMEa mere 
are, their couplings, etc) 

The PME arrays 280, 290, 300, 310 contain the functions needed to operate as etthei SlMD or MIMD 
;s devices. They also contain fun ca oris that permit the complete set of PMEs to be divided into 1 to 258 
distinct subsets. When divided Into subsets the Array Director 250 tnierleaves between subsets. The 
sequence of the interleave process and the amount of control exercised over each subset is program 
controlled. This capability to operate distinct subsets of the array In one mode, I.e., MIMD with differing 
programs, while oilier sets operate In tightly synchronized SIMD mode under Array Director control, 
3> represents an advance in the an. Several examples presented later muscat* the advantages of tiw concept. 

Array Architecture 

The set of nodes forming the Array is connected as a n-cSn^nsional modified hyper cube. In that 
2* interconnection scheme, each node has owect connections to 2n other nodes. Those connections can be 
either simplex, hart-duplex or fun-duplex type pains, in any dmsnsion greaser than 3d. the modified 
hypercube Is a new concept In imercwtneaJon techniques (The modified hyperoube in the 2d oase 
generates e torus, and in the 3d case an orthogonally connected letfce with edge surfaces wrapped to 
opposing surface,) 

30 To describe the intercomectton scheme tor greater than 3d cases requires an inductive description. A 
set of mi nodes can be interconnected as a ring, (The ring could be 'simply connected'. 'braided 1 , T cross 
connected*, 'fully connected*, efe Although additional node ports are needed for greater than simple rings, 
that added complexity does not affect the modified hypercube structure.) The ma rings can than be linked 
together by connecting each equivalent node In fee m* set of rings. The resuti at this point Is a torus. To 

38 construct a 1+ Id modified hypercube from an id modified hypercube, m»i sets of Id modified hypercubes 
and interconnect si of the equivalent m, level nodes into rings. 

This process is frustrated tor the 4d modified hypercube. using mi ■ 8 tor \ - i„4 by the illustration m 
FIGURE & Compare our description under node Topology and also FIGURES 6, 0, 10, 16 and 18, 

FIGURE 6 illustrates the fine-grained parallel technology path from the single processor element 300, 

<o mads up of 32K ie-bR words with a 16-bit processor to the Network nods 310 of eight processors 312 and 
their associated memory 31 1 with their fully distributed I/O routers 313 and Signal \iO ports 314, 315, on 
through groups of nodes labeled dusters 320 and Into the cluster configuration 360 and to the various 
application c 330, 340, 350, 370- The 2d level structure is the cluster 320, and 6* dusters are integrated to 
form the 4d modified hypercube of 32.768 Processing Elements 360. 

Processing Array Element (FIVE) Preferred Embodiment 

As frustrated by FIGURE 2 and FIGURE n the preferred APAP has a basic buUcfing block of e one 
chip node. Each node contains 8 identical processor memory elements (PMEs) end one broadcast and 

so control interface (BCl). White some of our inventions may be implemented when all functions are not on the 
same chip, it Is important from a performance and cost reduction standpoint to provide the crap as a one 
chip node with the 6 processor memory elements using the advanced technology which we have described 
and can be implemented today. 

Tha preferred implementation of a PME has a 64 KByte main store, 16 15-blt general registers on each 

35 of 0 program interrupt levels, a full function arithmetic/logic unit (ALU) with working registers, a statue 
register, and four programmable oKfirecftonsJ I/O ports, in addition the preferred implementaOon provides a 
SIMD mode broadcast interface via the broadcast and control Inferface (BCl) which allows an external 
controller (see our original parent application and the description of our currently preferred embodiment for 
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a nodal array and system wnh riustens) to drive PME operation decode, memory address, and ALU data 
Inputs, TOa chip can perform the functions o! a mtoiwomputer allowing multiple paraBel operations to be 
performed v^hai it and It can be coupted to other chips within a system of multiple nodes, whether by an 
interconnection network, a mesh or hypercube network, or our preferred and advanced scalable ernbodi- 
s meni 

The PMEs are interconnected In a settee of rings or tori m our preferred scalable embodiment In some 
applications the nodes could be interconnected In a mesh. In our preferred embodiment each node contains 
too PMEs in each of four tori. The tort are denoted WXY. and Z (see FIGURE 6) .FIGURE 1 1 deplete the 
btwrcwinectton of PMEs w&hin a node. The two PMEs In each tows are designated by tnetr external K> 
jo port i+VY, -W, +X. -X, + V, -V, +Z. -2). Within ihe node, there are also two rings which Interconnect the 4 
+nand4-nPMEs. These internal rings provWe the path tor messages to move between the external tori 

Since IN APAP can be fen our preferred embodiment a tour dimensional orthogonal array, the internal 
rings allow messages to move throughout the array in aU dimensions. 

The PMEs are sett-contained stored program microcomputers comprising a main store, locel store, 
decode, arfth meticlogic unit (ALU), working registers and InpuVOu^ut I/O ports. The PMEs have 
the capabWy of fetching and executing stored InsuucSons from meir own main stare in IVIMD operation or 
to fetch and execute commands via the Bd interface bi SIMD mode. This interface permits iniertormnunr- 
cetfon arnong the controller, the PME, end other PMEs In a system made up of rnurtpl© chips. 

The BCI Is the node's Interface to die external array controller element and to an array oVector, The 
to BCI provides common node functions such as timers and docks. The bo provides broadcast function 
masking for each nodal PME and provides me physical interface and buffering for the broadcast-bue-iD- 
pme data transfers, and also pwvtdes toe nodal toterface to system status and monitoring and debug 
elements. 

Each PME contains separate mlerrupt levels to support each of its poir^to-pokit interfaces and the 

29 broadcast Interlace. Date Is Input to the PME main store or output from PME main store under Direct 
Memory Access (DMA} control. An "Input transfer complete* interrupt ie available for each of the interfaces 
to signal the PME software thai data is present Slants information Is available for the software to determine 
the completion of data output operation^ 

Each PME has a "circuit switched mode" of I/O m which one of its four input ports can be switched 

3D directly to ones of Its four output ports, without having the data enter ihe PME main store, detection of the 
seuroe and destination of the "circuit switch" is under control of the softwaie executing on the PME. The 
other three input ports continue to have access to PME main store functions, white Ihe fourth input is 
switched to an output port 

An addiftonaJ type of to has data that must be broadcast to, or gathered from elf PMEs. plus data 

05 which Is too specialized to fH on me standard buses* Broadcast date can induce SIMD commands, MMD 
programs, and SIMD data Gathered date is primarily status end monitor (unctions. Diagnostic and test 
funcfcns are the speciatzed date element* Each node, m add-on to the mended set of PMEs, contains 
one SCI During operations the SGI section monitors the broadcast interface and steorefcoltects broadcast 
data to/fiom the addressed PMEts). A combination of enabling masks and addressing tags ere used by the 

« BCI to <tetermmd what broadcast Information te Intended for which PMEs. 

Each PME Is capable of operating in 3*40 or in M1MD mode in our preferred embodiment in SIMD 
mode, each instruction is fed into the PME from the broadcast bus via the BCI. The BCI buffers each 
broadcast date word unta al of Its selected nodal PMEs have used it TOs sync fa e r ttaa to n provides 
accommodation of the date iming dependencies associated with the execution oT SIMD commands and 

« aaows asynchronous operations to be performed by a PME. in mjmd mode, each PME executes its own 
program from Its own main store. The PMEs are Mfofaed te me SIMD mode- For MIMD operations, the 
external controller normally broadcasts Die program to each of 8ie PMEs white Ehey are in SIMD mode, and 
then commands the PMEs to switch to MM) mode and begin executing. Masking/legging the broadcast 
mtorrranron allows dnteront sets of PMEs to contain different MIMD programs, end/or selected sets of PMEs 

so to operate In MIMD mode while other sets of PMEs execute bi SIMD mode. In various software clusters or 
partitions these separate functions can operate independently of the actions in other dusters or partitions. 
The operaton of the Instruction Set Archrreciure (ISA) of the PME will vary slightly depemSng on whetfier 
the PME Is In the SIMD or MIMD mode. Most ISA instructions operate IdenfccaJly regardless of mode. 
However, since the PME in SIMD mode does not perform branching or other control functions some code 

53 points dedicated io (hose MIMD instructions are reinterpreted in SlMO mode to alow the PME to perform 
special operations such as searching main rrtemory for a match to a broadcast data value or swt&hino to 
MWD mode- This further extends system SexWIity of an array. 
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PME Architecture 

Basically, our preferred aichtechrre comprises a PME which has a 16 Wt wide date flow, 32K of 16 bit 
memory, specialized I/O parte and I/O switching paths, plus the necessary control logic to permS ozch PME 

5 to fetch, decode end axecuie the 18 bit Instruction set provided by our instruction set architecture (ISA). 
The preterm) PME performs the funcSone of a virtual roofer, and thus performs bom the processing 
functions and data router functions. The memory organization allows by cross addressing of memory 
between PMEs access to a large random access memory, and direct memory toi the PME. The individual 
PME memory can be el local, or divided into local and shared global areas pragrammaflcaHy. Specialized 

10 con&ds and capabilities which we describe permit rapid task switching and retention of program stale 
information at each of the PMEs interrupt execution levels, Although some of the capabilities we provide 
have existed In other processors, their appftcaaon for management of intenrocessor UO (a unique m 
massively parallel machines. An example Is the integrate or the message router function into the PME itself. 
This elmJnates cpedallaed router chips or development of spedalzed VLSI routers. We also recognize that 

;s in some Instances one could distribute the functions we provide on a single chip onto several chips 
Inter conne cted by mstafizafJon layers or otherwise and accomplish improvements to massively parallel 
machines- Further, as our architecture b scalable from a single node to massively parallel supercomputer 
rev«i machines, H is possible to utiBze some of our concepts at different levels. As we will illustrate for 
example our PME data flow Is very powerful and yel operates to make the scalable design effective. 

so The PME processing memory element develops for each of the mu@pJe PMEs of a node, a fuBy 
distributed architecture. Every PME will be comprised of processing capabWy with 16 bit date now, 64K 
bytes of local storage, store and forward/circuit switch logic. PME to PME communication, SlMD/MlMD 
switching capabilises, programmable routing, and dedicated Hoaang point assist logic. These functions can 
be independently operated by the PME and integrated with other PMEs within (he seme chip to minimize 

ss chip crossing delays- Referring to FIGURES 7 and 8 we IHj strata *e PME dataflow. The PME consists of 
16 Off wide dataflow 425. 435. 445, 455. 465. 32K by 16 brt memory 420. specialized VO ports 400. 410. 
480, 490 and I/O switching paths 425, plus the necessary control logic to permit the PME to fetch, decode 
and execute a 18 bit reduced instruction set 430. 440, 450. 480. The special logic also permits the PME to 
perform as bosh the processing una 4S0 and data router. Specialized controls 405, 406, 407. 408 and 

30 capabilities are Incorporated to permit rapid task switching and retention of program state information at 
each of the PMEs" interrupt execution levels. Such capabilities have been included in other processors; 
however, their application specrfteaity for management of Interpreoessor I/O Is unique in massively parallel 
machines. GpecKicaify, U permits the integration of the router function Into ths PME without requiring 
specialized chips o» VLSI development macros, 

ss 

16 bit internal data flow and control 

Ths major parts of the internal data flow of the processing element are shown in FIGURE 7. FIGURE 7 
illustrates the Internal data Sow of the processing element This processing element has a full 8 brt internal 

40 data How 425. 435, 445. 455, 465, The Important paths of the internet data flows use 12 nanosecond hard 
registers such as die OP register 450, M register 440, WR register 470, and the program counter PC 
register 430. These registers toed the fuBy distributed ALU 480 and I/O router registers and logic 405, 408. 
407, 400 for all operations. With current VLSI technology, the processor can execute memory operations 
and instruction steps at 25 Mhz> and it can build the important elements, OP register 450« M register 440, 

*a WR register 470> and the PC register 430 wish 12 nanosecond hard registers. Ctther required registers are 
mapped to memory locations. 

As seen in FIGURE 8 Die internal data Bow of the PME has to 32K by 18 brt main store Hi the form of 
two DRAM macros. The remainder of the data flow consists of CMOS gate array macros. All of fte memory 
can be formed with the logic wfch low power CMOS ORAM deposition techniques to form an very large 

so scaled Integrated PME chip node. The PME Is replicated 8 times In the preferred embodiment of the node 
chip. The PME data flow consists of a 16 word by 16 bit general register stack, a multifunction 
arfthmettertogic unit (ALU) working registers to buffer memory addresses, memory output registers, ALU 
output registers, cperattontommand* I/O output registers, and multiplexors to select inputs to the ALU and 
registers- Current CMOS VLSI technology for 4MByte DRAM memory with our logic permits a PME to 

ss execute instruction steps at 25Mh*. We are providing the OP register, (he M register, the WR register and 
the general register stack with 12 nanosecond hard registers, Other required registers are mapped to 
memory locations wiftin a PME, 
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The PME data flow is designed as e 18 bit integer arithmetic processor. S|pectai multiplexor pate have 
boon added to opdrrrin subroutine emulation of n x 1a bU floating point operation* (n=>1). The 16 bit data 
flow perrons effect emulation of floa*H) po^m operation*. Specific paths within fre data flow have been 
included to permit floating point oporafions in as little as 1 0 cycles. The ISA toefcdes special code point to 

s permit srtrouttnes for extended (longer Iran 16-bit) operand operations. Tte subsequent floating point 
performance Is approximately one twenaeth the axed flooring point performance. This performance is 
adequate to eKninate the need for special floating point chips augmenting the PME as Is characteristic ot 
oft* massively peralei machines. Some other processors <to include the special floating point processor* 
on the same chip ae a single processor (See FIGURE 1). We can enable special floating point hardware 

jo processors on the seme chip with our PMEs but we would now need additional eeHs than is required for the 
preferred embodiment For floating point operation^ see afeo the concurrentry filed FLOATING POINT 
application referenced above for improvements to the IEEE standard. 

The approach developed is wal poised to take advantage of the normal Increases in VLSI technology 
performance. As circuit stze shrinks and greater packaging density becomes possible then data flow 

75 elements Bee base and indsx reverters, currency mapped to memory could be moved to hardware, 
likewise, floating point sub-steps are accelerated with additional hardware which we wtl prefer for the 
developing CMOS DRAM technology as reiaWe hinher density levels are achieved. Very importantly, Ms 
hardware alternative does not affect any software. 

The PME Is Initialized to SJMD mode with interrupts disabled. Commands are ted Into the PME 

3d opetaaon decode buffet from fie Bd. Each *me an instructiori opertfon completes, the PME requests e 
new command from the BCt Si a simlar manner, knmediara data is requested from the BCI et the 
approprlsle point In the instruction execution cycle. Most instructions of the ISA operate identically whether 
the PME is in SIMD mode or In MIMD mode, wHh \h* exception of Ihat SIMD instructor* and immediate 
daia are taken from the BCI; m MIMD mode the PME mainlams a program counter {PC} and uses lhat as 

9 the address within own memory to fetch a 18 bit instruction, instructions such as •Branch- which 
sxpfieffiy address the program courier have no meaning in SIMD mode and some of toose code points are 
reinterpreted to perform special SIMD functions as comparing immediate data against an area of main store 
The PME instruction decode logic permits either SlMDiMMD operational modes, and PMEs can 
transition between mooes dynemictfry. to smu mode the PME receives decoded intiruction information 

so and executes lhat data In the next dock cyde. In MMD mode the PME maintains a program counter PC 
address and uses that as the address within its own memory to fetch a 16 bit Instruction. Instruction decode 
and execution proceeds as in most any other RISC type machine. A PME in 3WD mode enter* MIMD 
mode when given the InTormaiion that represents a decode branch. A PME In MMD mode enters the SIMD 
mode upon executing a specific instruction for the transition. 

as When PMEs transition dynamically between SIMD and MIMO modes, an MiMO mode is entered by 
execution erf a SIMD w wt«e conW register instruction with the appropriate control Wt set to a "1 •.Attbe 
completion of the SIMD »isiructk>n, m PME enters the MWD mode, enables interrupts, and begins 
fetching and executing its MIMD instructions from the main store location specified by Ha general register 
RO. Wemipts are masked or unmasked oepencRng on the state of interrupt masks when the MM) control 

40 bit Is set. The PME returns to SIMD mode either by being externally rolnitiateed or by executing a MIMD 
"write control register" instruction wift the appropriate control bk set to aero. 

Data communication paw and control 

« Returning to Figure 7 it will be seen that each PME has 3 input pons 400, and 3 output pons 480 
intended for on-chip comrrwnicarJon plus 1 I/O port 410, 440 for off chip communications- Existing 
technology, rather than lie processor idea, require* lhat the off-chip port be byte wide half duplex. Input 
ports are connected such that dste may be routed from input to memory, or from input AR register 406 to 
output register 408 via direct 16 bft data path 4£&\ Memory would be the date sink wr messages targeted at 

so the PME or for messages that were moved in 'store and forward" mode. Messages foal do not target the 
particular PME are sent directly to toe required output port, providing a 'circuit switched" mode, when 
Hocking has not occurred The PME SW is charged with performing toe routing and determining the 
selected transmission mode. TO* makes dynamically selecting between 'circuited switched 7 and 'store and 
forward' modes possible. This is also another unique characteristic of the PME design. 

as Thus, our preferred node has fi PMEs and each PME has 4- output ports {Left, Right, Vertical, and 
External Three of the input ports end three of the output ports are 1 6-bit wide ton duplex point-to-point 
corm ec Bo m to the other PMEs on the chip. The fourth ports are combined m the preferred ewtfxritoent to 
provide a heJf duplex poinHo-pcint connection to an off-chip PME. Due to pin and power constraints fiat we 



22 



EP 0 570 729 A2 



have Imposed to make use of the less dense CMOS we employ, the actual of toup interface is a byte-wide 
path which to used to multiplex two halves of the inter-PME data word. With epedd "Upper dmuto-y which 
provides a dynamic, temporary logical breaking of msemiocW rings to allow data to enter or leave an array, 
these external PME ports provide the APAP external I/O array function. 

* For data routed to the PME memory, normal DMA Is supported such that the PME Instruction stream 
must become involved in the VO processing omy at fie beginning and end of mesaagsa. Finally, data that Is 
being 'circuit switched to an Internal output port Is forwarded without docking. This permte single cycle 
data transfers within a chip and detects when chip crossings will occur such that the fastest but still reflabJe 
communication can occur* Past forwarding udfizes forward data paths and backward control paths, aJI 

10 operating in transparent mode. In essence, a source looks through several stages to see the acknowledg- 
ments from toe PME performing a DMA or ofKfttp transfer. 

As seen by FIGURES 7 and 8 Data on a PME input port may be destined tor the local PME, or for a 
PME further down the ring. Data destined tor a PME further down the ring may be stored in the local PME 
main memory and then forwarded by the local PME towards the target PME {store and forward)* or me local 

jq input port may be logically connected to a particular local output port (ctrcurc switched) such that toe data 
passes Transparently'' through fhe local PME on its way to the target PME. Local PME software 
dynamlcaly controls whether or not me local PME is to "store and forward 7 mode or in "circuit swftchetP 
mode tor any of the tour inputs and four outputs to circuit swtthed mode, the PME concurrently processes 
all functions except the I/O associated wish ate circuit switch: In store and forward mode the PME suspends 

*o fill other proce«ing functions to begin the 10 tor«arcfing process. 

While data may be stored ex ternary of the array in a shared memory or DASO (with external controller), 
it may be stored anywhere in the memories provided by PMEs. Input data destined tor me local PME or 
buffered to toe local PME during "store and forward" operations is placed into local PME mato memory via 
a direct memory access (address) mechanism associated with each of the input ports, A program interrupt 

£S Is available to tocflcate fret a message has been loaded into PME main memory. The local PME program 
interprets header data to determine it the data destined for toe local PME a control message which can 
be used 10 set up a drcutvswttched path to another PME, or whether II is a message to be forwarded to 
another PME. Circuit switched paths are controfled by local PME software. A circuit switched path logically 
couples a PME Input path directly to an output peth without passing through any intervening buffer storage. 

so Since lira output paths between PMEs on the same chip have no tolerating butter storage, data can enter 
the chip, pass through a number of PMEs on the chip and be loaded info a target PMPs main memory in a 
single clock eyelet Only when a circuit switch comb nation feavea the chip, » an intermediate buffer storage 
required This reduces the effective diameter of the APAP array by a number of unbuffered circuit switched 
paths. As a result date can be sent from a PME to a target PME to as tow clock cycles as there are 

38 (nterventog chips, regardless of the number of PMEs in the pain. This kmd of routing can be compered to a 
switched em-honmer* to which at each node cycles are required to carry data on to toe next node. Each of 
our nodes has 8 PMEs! 

Memory and Interrupt Levels 

The PME contains 32K by 1 8 bit 420 dedicated storage words. This storage m completely general and 
can contain both data and program. In 81 MD operations all of memory could be data as Is characteristic of 
other SIMD massively parallel rnachtoes. to MIMD modes, the memory is quite normal; but unlike most 
massively parallel MIMD machines toe memory is on toe same chip wrto the PME and is thus, immediately 

46 aveuabta This men eliminates the need for cache-tog and cache coherency techniques characteristic of 
other massively parallel MIMD machines, to trie case for Instance of tomos's chip, ordy 4K resides on the 
chip, and externa] memory interface bus and pins are required These are eflmtoated by us. 

Low order storage locations are used to provide a set of general purpose registers tor each interrupt 
level. The particular tSA developed tor the PME usee shore address fields for these register references. 

so Interrupts are utilized to manage processing. UO activities and SArV specified functions (i.e., a PME in 
normal processing wiD switch to an Interrupt level when inoomino i>0 initiates), if the level is not masked, 
the switch te made by changing a poirnsr to H/W such that registers are accessed from a new section of 
low order memory and by swapping a stogie PC value. This technique permits last level switching and 
permits SW to avoid toe normal register save operations and also to save status within the Interrupt level 

98 registers. 

The PME processor operates on one of eight program Interrupt levels. Memory addressing permits s 
parftiordng of toe lower 576 words of memory amoung toe eight levels of interrupts, B4 of these 676 words 
of memory are directly addressable by programs executing on any of toe eight is vets. The other 512 words 
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are partitioned Into eight 64 word segments. Each 64 word segment is directly accessible only by programs 
executing on ha associated tatemipt level. Indirect addressing techniques cue employed for atoning: all 
programs to access all 32K words of PME memory. 

The interrupt ravels era assigned to the input ports, the BO, and to error handling. There Is a *normaT 

s level but there Is no "privileged "> nor "supervisor level A program interrupt causes a context switch In 
whten the comers* of the PC piogram counter, status^orrtrot register, end selected general registers are 
stored to specified main memory rocarfons and new values for these registers ore fetched torn other 
specified main memory locations. 

The PME data flow dtecussed with reference to FIGURES 7 and 8. may be am pilled by retarence to 

jo the additional sections below. In a complex system, the PME data flow uses the combination of tha chip as 
an array node with memory* processor and lO which sends and receives messages wSh the BOl that we 
repflcate as the baste butefing block of an MMP bum wi§> our APAP, The MMP can handle many wo?d 
lengths* 

>5 PME Mugpje Length Date Flow Processing 

The system we describe can perform the operations handled by current processors w9j t>e dataflow in 
the PME which Is 16 bits wide. TO* is dons by performing operations on data lengths which are multiples 
of 16 bifcu This Is accomplished by doing tfie operation In 18 bit pieces. One may need to know the result 
a? of each piece $ «. was it zero, was there s carry out of the high bits of me sum). 

Adding too numbers of 48 bits can be an example of dasa flow. In this example two numbers of 46 bits 
{ afCM7> and WEM7) ) are added by performing the following in the hardware: 

a(32-47) + b(32-47)->ene(32-47) step one 



1) save the carry out of high bit of sum 

2) lemember If partial result was zero or non-zero 

a> a(l601)-!-b{1 6-31)+ save (2irry->anB{16<*1) step two 



1) save the carry out of high bit of sum 

2) remember If partial result was zero or non-zero from this result and from previous step; if both are 
a zero remember zero; if either is non-zero remember non-zero 

a(0-i5) * b((M5) + saved carry->ais(0-1 5) final step 



40 (Mitts piece is zero and last piece was zero ana is zero 

2) if (his piece is zero and last piece was non-zero ana is norreero 

3) i this piece Is non-zero ens is positive or negative based on sign of sum (assuming no overflow) 

4) if cany into sign of ans os not-equal to carry out of sign of answer, ans has wrong sign and result Is 
an overflow (can not properly represent in the avaflabJe bits) 

45 The length can be extended by repeating the second step to the xtiom as many times as required, if the 
length were 32 the second step would not be performed. Ithe tengft were greater than 48, step two would 
be done multiple limes. If ins length were just 16 the operation in step one. with conditions 3 and 4 of the 
final step would be done. Extending the length of the operands to multiple lengfts of fre data Sow k$ a 
technique having a consequence that the instruction usually takes longer to execute for a narrower data 

so flow. That Is. a 32 bit add on a 32 bit data How only takes one pass through the adder logic, while the same 
add on a 16 bit data now takes two passes through the adder logic. 

What we have done fiat Is Interesting is that In the current Implementation of me machine we have 
single Instructions which can perform adds/subtrextyoompiuesmicves on operands of length 1 to 6 words 
(tonga is defined as pert the msmjciton). individual instructions avaiable to the prc^rammer perform the 

S3 same kind of operations as shown above for steps one, two, and final (except to the programmer the 
operand length is longer Le. 16 to 126 bite). At the bare bones hardware level, we are woridng on 16 bit? at 
a time, but the programmer thinks sme'e doing 16 to 128 bits at a time. 
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By using combinations of these instructions, operands of any length can be manipulated by ths 
program Le. two instructions can ba ueed to add wo numbers of up to 263 bits in length. 

PMC Processor 

Our PME processor id different from modern mtoopttcesaor? currently uSazed for MPP appiicetkm. 
Ths processor portion dWIopsnces include: 

1. the pnxessor la a fuBy progrernrnabie hardwired computer (see the ISA description for an instruction 
set overview} wttte 

io o it has a complete memory module shown In the upper right comer (see FIGURE Q\ 

o it nas hardware regt?*r$ with controls required to emulate separate raster eats for each interrupt 

level <3to«n ft upper left comer), 
o its ALU has the required registers and controls to permit effective muffi-cycle integer and floating 

point artfwnefct 

;s o a hsa I/O swnchfng paths needed io support pecker! or circuit switched data movement between 

PMEs tntefoon necled by point-to-point Onto shown In 0)9 lower right comer* 

2. This is our minimalist approach to processor design permitting eight replications of the PME pet chip 
with the CMOS DRAM technology. 

3. This processor portion of the PME provides about the minimum dataflow width required to encode a 
iso fast Instruction Set Architecture {©A) -see TeWes - which is required to permit effecfre niiwd or SIMD 

operation of our MMP, 

PME Resident Software 

a The PME Is the smallest element of the APAP capable of executing a stored program 1$ can execute a 
program which is resident in some external control element and fed to it by the broadcast and control 
interface «BCI) in SMD mode or ft can execute a program which is resident In Its own main memory <MIMD 
model 6 can dynamically switch fcertween SIMD mode end MIWD mode representing SIMD/MIMD mode 
duality functions, and ateo fte system can execute those auaifjee at the same time (SlMiMD mode), A 

30 particular PME can make this dynamic switch by merely setting or resetting a oil In a control register. Since 
SIMD PME software is actually resident in the external control element, further cfecusston of this may be 
found fri our discussion of the Array Director and in related applications, 

NHMD software la stored Into the PME main memory while the PME is in SIMD mode. This Is Feasible 
since many of fte PMEe wfll contain identical programs because they will be processing similar data in an 

38 asynchronous manner. Here we would note that these programs are not fixed, but they can be modified by 
loading the WWD program from an external source during processing of other operations. 

Since the PME instruction set architecture represented in the Tables is that of a inicrooomputer, tore 
are few restrictions with this architecture on the functions which the PME can execute, Essentiaiy each 
PME can function Ike a RISC microprocessor. Typical mjmd PME software routines are Mated below: 

40 1 . Basic control programs for dispatching and prioritizing the various resident routines. 

2. Communication software to pass data and control messages among the PMEe. This software would 
determine when a particular pme would go Wo/out of the "circuit switched" mode, it performs a "store 
and forward" function as appropriate, it also In&stea, sends, receives, end terminates messages between 
its cwn main memory and that of another PME. 

3. Interrupt handling software completes the context switch, and responds to an event which has caused 
Bin iniBrrupt These can induce teB-ssfe routines and rerouting or reassignment of PMEs to an array, 

A, Routines which implement the extended ta&Snuctton 6et Architecture which we describe below. These 
routings perform macro level instructions such as extended precision fixed point arithmetic, floating point 
arttmefic. vector arthmetic, and the tike. This permits not only complex math to be handled but image 
30 processing activities for display of image data in multiple dimensions {2d and 3d images) and multimedia 
processes. 

5. Standard mathematical Sbrary functions can be included. These can preferably inducts UNPAK and 
VPSS routines. Since each PME may bo operating on a different element of a vector or matrix, the 
various PMEs may all be executing different routines or dif ferine, portions of the seme matrix at one «1me, 
58 6. Specialised routines lor performing scatler.^ather or sorting functions which take advantage of the 
APAP nodal toterconoeotion structure and permit dynamic moW-dimenslonal routing are provided. The 
routines effectively tekt advantage or some amount of syncbronkation provided among m various 
PMEs, while permitting asynchronous operations to continue, Tor sorts, there are sort routines. The 
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APAP is wcfl sueed to a Bateher Sort Because that sari requires extensive calculations to determine 
particular element to compare versus vary short comparison cycle** Program eyiwhrcriixeJon is mart- 
aged by toe WO stotemeot*. The program asows nnuWpfe date elements per PME and very larg^paresei 
sorts in quit? a stiaio^Jt forward manner* 
s White each PME has Its own resident sofhvare, the systems made from these microcomputers can 
axecuta higher level language processes designed tor scalar and parafteJ machines. Thus the system can 
execute application programs vdtfch have been written tor UNfX machines, or those of other operating 
systems, in high level languages such as Fortran. C< 0 + + . ForfrenD, and so on. 

ft may be an Interesting footnote that our processor concepts use an approach to processor design 
10 which is quite old. Perhaps thirty years of use of a similar ISA design has occurred in IBM's military 
processors* We have been toe *st to reoogntee that this kind of design can be used to advantage to 
leapfrog the dead ended current modem microprocessor design when combined with our tote) PME deafen 
to move fte technology to a new path tor use in the next century. 

Although toe processors design characterises are quite different from other modern microprocessors, 
jc similar gate constrained msBary and aerospace processors have used the design since the *80s. It provides 
suf Sclent instructions and registers for straight forward compiler development and both general and signal 
processing applications are effectively running on this design. Our design has mWrnal gate requirements, 
and e3M has Implemented some similar concepts for years when embedded chip designs were needed 
general purpose processing. Our adoption no* of parts of fie older ISA design permits use of many uttBUes 
aa and other software vehicles which wil enable adoption of our systems at a repkf rate because of the 
existing base and the knowledge tost mssny programmers have of the design concepts. 

PME VP 

3 The PME will interface to toe broadcast and control Interface (BO) bus by either reading data from toe 
bus Into toe ALU via the path labeled BCI in FIGURE e or by fetching tosmjcnons from the bus directly into 
the decode logic (not shown). The PME powers up to SttID mode and in that mode reads, decodes and 
executes Instructions until encountering a Diarchy A broadcast command 3tMD mode causes toe fransrtton 
to MKMD with inatrucsone fetched locally. A broadcast PME instruction wernal DtOW reverts toe state. 

a> PME I/O can be sending data, passing data or receiving data. When sending data, the PME sets the 
CTL register to connect XMT to either U R* V, or X, H/W services toon pass a block of data from memory 
to the target via the ALU multiplexer and the XMfT register- This processing interleaves with norma) 
instruction operation. Depending upon application requirements, the block of data transmitted can contain 
raw data for a predefined PME and/or commands to establish paths, A PME that receives data win store 

as toput to memory and interrupt toe active lower level processing. The interpretation task at the interrupt level 
can use the imerrupi evem to do task synchronization or Initiate a transparent I/O operation r>lieo data is 
addressed elsewhere^ During toe transparent I/O operation, toe PME is tree to continue execution. Its CTL 
register makes H a bridge. Data will pass through it without gating, and it will remain in that mode until an 
Instruction or the data stream resets CTL While a PNC is passing data It cannot be a date source, but ft 

4? can be a d&a sink for another meseage. 

PME Broadcast Section 

This is a chip-to-common control device interface. It can be used by the device that serves as a 
* controller to ocmrnand I/O, or tost and (flagnose the complete crop. 

Input is word sequences feftier instruction or data) that are available to subsets of PMEs- Associated 
wtm each word la a code Indicating which PMEe wfl use the word. The (he BCI win use the word both to 
limit a PUPs access to toe irrterfaco and to assure that all required PMEa receive data. This serves to 
adjust the BCI to the asynchronous PME operations. (Even when in SIMD mode PMEs are asynchronous 
so due to I/O and tolerrupt processing^ The mechanism permits PMEs to be formed into groups which are 
controlled by interleaved sets of commano>data words received over the BCI. 

Besides delivering data to the PMEs, toe 601 accepts request codes from toe PME combines them, 
and transmits Hie Integrated request This mechanism can be used in several ways. MIMD processes can 
be tohlatod In a group of processors that ail end with an output signal. The 'ANT/ of **gneja triggers the 
ss controller to initiate a new process. Appfications, in many cases, will not be able to load all required 3/W in 
PME memory. Encoded request to toe controller wil be used to acouire a S/w overlay from perhaps toe 
host's storage system* 
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The controller uses a serial scan loop through many chips to acquire information on individual chips or 
PMEs. 71*680 loops Initially interface to the BCl but can in the BCl be bridged to Individual PMEs. 

Broadcast and Control Interface 

The 601 broadcast and control interlace provided on each chip provides a paraSei Input interface such 
that data or instructions can be sent to the node. Incoming data is tagged with a subset identifier and the 
BCl includes the functions required to assure that all pmes within the node, operating *lthin the subset, are 
provided the data or IneirucDone, The parallel Interlace of the SCI serves both as a port to permit data to be 
w broadcast to all PMEs end as the instruction interlace during SIMD operations. Satisfying both requirements 
plus extending those requirements to supporting subset operations is completely unique to this design 
approach. 

Our BCJ paraSel Input interface permiis data or irotruttiona to be sent from a control element that Ea 
external to the node. The BCl contains "group assignment* registers (see the grouping concept In our 

;s above applscafion erratled GROUPING OF SIMD PICKETS) which are associated with each of the PMEs. 
incoming data words are tagged with a group identifier and the BCl Includes the functions required to 
assure that all PMEs wimio the node which are assigned to the dedicated group are provided the data or 
Instructions. The paratiel interface of the BCl serves ae both a port to permit data to be broadcast to the 
PMEs during M1MD operations, and as the PME tmaijcuon/bimeolate operand Interface during SIMD 

so operation*. 

The 601 also provides two serial interfaces. The high speed serial pari vrifl provide each PME wim the 
capabfflty to output a limited amount of status information. That data is intended to: 

1 . signal our Array Director 61 0 when me PME, eg. 500, has data mat needs to be read or that the PME 
has completed some operation, tt peases a message to the external control element represented by the 

20 Array Director, 

2. provide actrvrty statu* such mat externa! test and monitor elements can Hhistrate the status of the 
entire system. 

The standard serial port permits the external control element means for selectively accessing a specific 
PME tor monitor and control purposes. Data passed over this interface can direct data from the 0a parallel 

so interface to a particular PME register or can select data from a particular PME register and route it to the 
high speed serial port. These control points allow the external control element to monitor end control 
individual PMEs during initial power up and diagnostic phases. It permits Array Director to Input control data 
so as to direct U*e port to particular PME and node Internal registers and access points. These registers 
provide paths such that PME of a node can output data tome Array Director, and these registers permit the 

38 Array Director to input data to the units during infilal power up and diagnostic phases. Data Input to access 
point can be used to control test and diagnostic operations* le- perform single Instruction step, stop on 
compare, break points, etc. 

Node Topology 

Our modified hypercube topology conrwcfion is most useful for massively parallel systems, while other 
less powerful connections can be used with our basic PMEs* Within our InfflaJ embodiment of the VLSI chip 
are eight PMEs with toay distributed PME Internal hardware connections- The internal PME to PME chip 
corporation is a two rings of four PMEs, with each PME also having one connection to a PME in the other 

« ring. For the ease of eight PMEs in e VLSI cwp this to a three dimensional binary hypercube, however our 
approach in general does not use hypercube organizations wKhIn the chip. Each PME also provides tor the 
escape of one bus. In the Initial embodiment the escaped buses form one ring are cafied + X +Y, + W and 
+Z> while those from the other ring are labeled similarly except - (minus). 

The specific chip organization is referred to as the node of *s array, and a node can be in a cluster of 

so the array. The nodes are connected using +-X and +-Y Into an array, to create a duster. The 
dimensionality of toe array is arbitrary, and In general greater than two which is the condition required for 
developing a binary hypercube. The clusters are then connected using *-W, +-Z into a array of clusters. 
Again, the dimensionality of the array Is arbitrary. The result is lie 4-dmensionai hypercube of nodes* The 
extension to a S-dimensional hypercube requires the usage of a i0 PME node and usee the additional two 

58 buses, say + -El to connect the 4-dimenslonal hypercube into a vector of hypercubes. We have iien shown 
the pattern of extension to either odd or even radix hypercubes. This mooted topology limits the cluster to 
duster wiring while supporting me advantages of me hypercube cormecttoa 
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Our wireabi&iy and topology configuration for massively parallel machines has «dventegas m keeping 
the X and Y dimensions wtiMn our duster level of packaging, and in dl«rtbuJng ihe w and z bus 
connections to el the neighboring dusters. After implementing fte techniques described* the product wia be 
wtroabte. and manufecturabte while rnatitaining the inherent characteristics of the topology defined, 

5 The node consists of k ' n PMEs plus the Broadcast and Control Interface (BCl) section. Here *n" 
represents the number of dimensions or rings, which characterize the modified hypercube, while *V 
represents ihe number of rings that characterize the node. Although a node can contain k rtngs h is a 
characteristic of the concept that only t*o of *ose Hngs may provide escape buses. V and •k" Is limited 
to our preferred embodiment, because of the physical chip package to N-4 and k-2. This timtetion te a 

3D physical ones and different chips sals *t£ allow other and increased dimensionality in the array, in addition 
to being a part of the physical chip package* IS is our preferred embodiment to provide a grouping of pmes 
that interconnect a set of rings in a modified hypercube. Each node will have 8 PMEs with their PME 
arcnilecfure and abflfty to perform processing and data router functions. As such, n is the oSmenstonafty of 
the rnodWed hypercube (see fbflowing section), Le* a 4d modified hypercube's node element would be 6 

>5 PMEs while a 5d modified hypercube's node would be 10 PMEs. For vlsuaization of nodes which we can 
employ, refer to FIGURE 6» as well as FIGURES 9 and 10 for vtsuaszalion of interconnections and see 
FIGURE 11 for a block (fiagiem of each node. FIGURES 16 and 10 elaborate on possible interconnections 
for on APAP. 

ft wM be noted that the application entitled "lffiT>IOD FOR INTERCONNECTING AND SYSTEM OF 
HO INTERCONNECTED PROCESSING ELEMENTS 1 * Of 00-lnventor David B. Rone, fifed In the United St**S 
Patent and Trademark office on May 13. 1991, under U3SN 0*698,886, descr&ed the modified hypercube 
criteria which can preferably be used in connection with our APAP MNTP, 

That application is incorporated by reference and describes a method of interconnecting processing 
elements m such a way that the number of connections per element can be balanced against ihe network 
25 dtemeter (worst case path length}, ibis Is done by creating a topology that maintains many of the well 
known and desirable fopoJogfcaJ properSea of hypercubes white improving He flexfcflfty by enumerafing the 
nodes of the network m number systems whose base can be varied. When using a base 2 number system 
this method creates the hypercube topology. The invention has fewer Interconnections than a hypercube. 
uniform connections and preserves the properties of a hypercube. These properties include: i) large 
30 number of alternate paths,. 2) very high aggregate bandwidth, and 3) weU understood and existing methods 
that can be used to map otter common problem topologies with the topology of the network. The rest* is a 
generalized non-binary hypercube wfth less density. It wM be understood that with the preference we have 
given to the modified hypercube approach, in some applications a conventional hypercube can be utilized, 
hi connecting nodes, otter approaches to a topology could be used; however, the ones we describe herein 
as are beloved to be new and an advance, and we prefer the ones we describe. 

The interconnection methods for the modffled hypercube topology for interconnecting a plurality of 
nodes in a network or PMEs: 

1, defines a sets of integers ei, e2, e3. „. suoh the product of elements actuals (he number of PMEs 
en the network called Wl while the product off al elements In the set excepting el and e2 is the number 
40 ot nodes celad N, and the numbsr of elements in the set caned m defines me dimensionality of the 
network n by the relationship n = m-2. 

z addresses a pme located by a set of Indexes al. a2» ... am, where each Index Is the PMEs position in 
the equivalent level of expansion such that the Index al Is In the range of zero to ei-1 tor i equal to 1. 2, 
to nr. by the formula 

Ua(m)*e«iH) + a lm-2))s<rrH) a(2re<1)>*e(1) 

where the notion e® has the normal meaning of the me ith element m the nst of elements caned a, ot 
equhr aJently for e. 

50 3, connects two PMEs (with addresses f and g) if and only if either of the following tao conditions hokt 
& me stieger part ot (ei r e2) equate tl» integer part of s^e v e2) and: 

1. the remainder part of rtel defers from the remainder pan of s/ei by 1 or, 

2. the remainder part of rfe2 affers from the remainder pari of s/e2 by 1 or e2-l. 

b, me remainder parrot r/ei dffito from the remainder part of e/e* for I m me range 3. 4. m and the 

0 remainder part of r/el equals t» remainder part of &fe2 which equals i minus three, and the 

remainder part of rVe2 differs from the remainder pert of s/e2 by e2 minus one. 
As a result the computing system nodes wlft form a non-binary hypercube, wfth the potenfal for being 
different radix in each dimension. The node is defined as an array of PMEs which supports 2*n ports such 
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that the ports provided by nodes match the dfenensionalty requirements of the modified hypercube, & the 
set or integers 03. e4. em. which define me speefflc extent of each cSmenston of a pottcular mocfifled 
hypercube are e0 taken as equal, say b, and If 01 and & are taken a 1, then the previous formulas for 
addressability and connections reduce to: 
5 1.N=b'*n 

2. addressing a PME as numbers representing die base b numbering system 

3. cormecdng two computing elements (f and g] » and only if me address of f differs from the address of 
g in exactly one base b digit using the rule *e*0 and b-1 are separated by 1. 

4. fte number of connections supported by each PME te 2*n 

ro Which is exactly aa deacribed in the base application, with tfis number of communication buaes spanning 
nor>edfecem PMEs chosen as aera 

totra-ftodo PME susroonneofjons: 

;s PMEs ere configured wfthin the node aa a 2 by n array. Each PME Is interconnected vrffh fes three 
neighbors (edges wrap onty in the second dimension) using a eat of input/output ports, thus, providing fid 
duplex corrvnurncatfon capability between PMEs. Each PMEs external input and output port is connected to 
node M} pin*. Input end output ports may be connected to share pins for half-duplex communication or to 
separate ptns tor full-duplex capability. The Interconnections far a 4d modified hypercube node are shown 

so in figures 9 and 10. (Note that where n is even tie node can be considered tobea2by2byrW2 array.) 
RGURE 9 i&istrares the the eight prDcesatng elements 500, 510. 520, $30, 540, 550, 500. 570 within 
the node. The PMEs are connected in a binary hypercube communrcattan network. This binary hypercube 
displays three intra connecfions between PMEs (501, 51 1 521, 531, $41, 551. 561, 571, 500, 501, 592, $83,. 
Communication between tfre PME Is controlled by in and out registers under control of Ihe processing 

£0 element This diagram shows the various paths that can be taken to escape K> out any of the eight 
directions. +-w 525. 565. +-x 515. 555. +-y 505. 545. +-z 535, 575. The rxmmuntancn can be 
accomplished without storing the data Into memory if desired. 

It may be noted that whfie a network switch chip could be employed to connect various cards each 
having our chip with other chips of the system, r can and should dearatfy be eliminated. Our imsr PME 

3D network that we describe as (ha "4d torus" is the mecharesm used for Irrtar PMEHX5rnmunication. A PME 
can reach any other PME m the array on this interface. (PMEs In between may be Stcce*Forward or Circuit 
Switched) 

Chip Reiattonsmps for [ntggonnecjona 

ss 

We have discussed lite chip, and FIGURE ii shows a block diagram of the PME ProcesaotfMemocy 
chip. The chip consist of me following eternal each of which wffl be descrfoed in later paragraphs 

1 . a PMEs each consisting of a 'Id bit programmable processor and 32K words of memory (84K bytes]*, 

2. Broadcast Interface (BO) to permit a controller to operate all or subsets of the PMEs end to 
40 accumulate PME requests. 

3. Interconnection Levels 

a. Each PME supports four Q m wide tnter-PME communication paths. These connect to 3 
neighboring PMEs on the chtp and 1 off chip PME. 

b. BraadcasHc-PME boring, which mates data or instructions available. 

4$ c. Service Request lines that permit any pme to send a code to the oontronet- The bci combines the 
requests and forwards a summary. 

d- Serial Service loops permit Ihe controller to read aO detail about the functional blocks. Thia level of 
interconnection exiends from the BCI to all PMEs (FIGURE 1 1 lor ease of presentation omits this 
detail.) 

so 

Interconnection and Routing. 

The mpp wui be implemented by repftcaion of the pme. The degree of repBcatton does not affect the 
interwwection and routing schemes used. RGURE 6 provides an overvfew of the neiworfc interconnection 
55 scheme. The chip contains £ PMEs with lnterccnnec6ons to their immediate neighbors- This interconnection 
pattern results in the three dimensional cube structure shown in FIGURE 10. Each off the processors within 
me cube has a dedicated bidirectional byte pori to the chip's pins; we refer to the set of 0 PMEs as a node, 
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An n by n array of nocks is o duster. Simple bridging between me * and - x porta and the + and - y 
port* provide the duster node tatercomec&ona. Hera ttvo our preferred cNp or node has 8 PMEa, each of 
which manage a stogie external port This distributes Hie network control function and efimtnates a possible 
bottleneck for ports. Bridging the outer edges makes the duster into a logical torus. We have considered 
s dusters with n=4 and n«8 and believe that n = B Is the better solution tor commercial applications while 
n-4 te better for miliary conduction cooled aptfteaSons, Our ooncept does not Impose an unchengeaDIa 
duster size. On the contrary, we antidpete some applications using variations. 

An array of dusters results En the 4 <*mensionel torus or hypercube structure fflustrated In FIGURE io. 
BrWojng between the + and - w ports and + and - z ports provides the 4d torus interconnection*. This 
w results in each node within a cluster connected to an equivalent node in al adjacent dusters. (This provides 
04 ports between two adjacent dusters rather than the a ports that would result from larger dusters.} As 
svfih the duster size, the scheme does not imply a particular size array. We have considered 2x1 arrays 
desirable tor workstations and MIL applications and 4x4. 4x8 and 0x8 arrays tor mainframe appications. 

Developing an away of 4d toruses Is beyond the gate,, pm, and connector limitations of our current 
is preferred dtp. However, thai limitation disappears with our efemauVe on-chip optical diiver/raceiver Is 
employed. In Hits embodiment our network could use an optical path per PISE; wtti 12 rather than 8 PMEa 
per chip the array of 4d toruses with muH-Tflop (Teraflop) performance end It seems to be economical y 
feesfole to make such machines avatfabls tor the workstation environment Remember that such aftemafive 
machines wtll use the application programs developed for our current preferred embodiment 

4d duster Organization 

For constructing a 4d mooted nypercube 360, as ibstrated in FIGURES 8 end 10, nodes supporting 8 
external ports 315 ere required Consider fee external ports to be labeled as +X, +Y. +2, +W, -X. -Y, -2. 

29 -W. Then using rm. nodes, a ring can be constructed by connecting the +X to -X ports* Again ir* such 
rings can be interconnected into a ring of rings by interconnecting me matching * Yto -Y pons. This level 
ot structure will be called a duster 320, With mi = m 2 ~3 it provides for 512 PMCe and such a cluster will 
be a ouadtog block for several size systems <330, 340. 3501, as IBustresed with m=8 In FIGURE 

30 4d Array Organization 

For buWmg large fine-grained systems, sets of m 3 clusters are interconnected In rows using the +2 to 
-Z ports. The rru rows are then totercormected using the 4 W to -W ports. For m t ~.<jtu ~8 this results In 
system with 32768 or pmes- The organization does not require mat every dimension be equal y 

39 populated as shown in FIGURE 6 (large foe-grained parallel processor 370). in ere case of the fine-grained 
sma! processor, only a duster might be used wUh the unused Z and W ports being Interconnected on the 
card. This tecnnkiue saves card connector pins and makes possible the application ot this ecateUe 
processor to workstations 340, 350 and avionics applications 330, both of which are connector pro smiled. 
Connecting *h ports together to the Z and W pairs leads to a workstation organization that permits debug. 

4? test and large machine software development 

Again, much smaller scale verstons of the structure can be developed by generating the structure with a 
value smaller man m = 8, Tms wtt permit construction of single card processors compatible with the 
requirements for accelerators in the PS/2 or RtSC System 6X300 workstation 350. 

<* IK? Performance 

I/O performance includes overhead to setup transfers and actual burst rate data movement Se&jp 
overhead depends upon application function ¥0 complexity and network contention. For example, an 
application can program circuit switched traffic wfth buttering to resolve conflicts or it can have all PMEs 
so transmit left and synchronize. In tie first case. I/O is a major task and detailed analysis would be used to 
size it We estimate mat simple case setup overhead Is 20 to 30 dock cycles or £ to 1 2 u-eec. 

Burst rate 10 Is the maximum rate a PMG can transfer das (with either an on or off chip neighbor.) 
Memory access Emits set the data rate at 140 nsec per byte! corresponding to 7.14 Mbyte's. This 
performance includes buffer address and count processing pfos data maovwrite. It uses sever) 40ns cycles 
ss per 15 bit word transferred. 

This burst rote performance corresponds to a cluster having a maximum potential transfer rets of 3#5 
©bytes'*. & also means that a set of eight nodes along a row or column ot ft* duster will achieve 67 
Mbyte/a burst date rate using one set of their fl available ports. This number cs significant because W> with 
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the external works wfl b« done by logically 'unzipping 1 an edge of the wrapped duster and attaching it to 
the external system bus. 

inter-PME Routing Protocol 

Tna SIMD/MIMD PWE comprises ^erprocedsor communication to the externa) VO fftcffiaes* bioadcest 
corraroi imerfacas, and switching features which a tow both SIMD/WIMD operation whhln the same PME. 
Embedded in Hie PME la the fully distributed programmable I/O router for processor communication and 
data transfers between PWEe, 

?o The PMEs have folly distributed mterprocessor communication hardware to on-chip PMEs as well as to 
the external I/O feeffitks which connect to the interconnected PMEs to the modified hypercubo configura- 
tion- This hardware is com fomented with the flexible prcgjemn^bjliiy of the PME to control the VO activity 
via software. The programmable VO router funcfcns provide for generating data packets and packet 
addresses. With this Information the pme can send the information mm the network of PMEs in a directed 

;s method or out multiple paths determined by any fault tolerance rBquirsments. 

Distributed feufe tolerano© algorithms or program algorithms can take advantage of the programmabaty 
along wHh the supported Car curt switched modes of the PME. This performance combinational mode 
enables everything from off-line PMEs or optimal path data Efructuros to be acoomplshed via the 
programmable I/O router. 

so Our study of sppnesttons reveals that It is sometimes most efficient to send bsro data between PMEs. 
At other times applications require data and routing informatfen- Further, it is sometimes possibfe to p(an 
communicahons so (hat network conflicts cannot occur: other applications oiler the potential for deadlock* 
unless mechanisms for buffering messages at intermediate nodes are provided. Two examples fflustrate trie 
extremes. In the relaxation phase of a POE solution, each grid point can be allocated to a nod*. The inner 

£9 loop process of acquiring data from a neighbor can ossify be synchronized over a* nodes. Alternatively, 
image transformations use local data parameters to determme communication target or source tttentfters. 
This results in data moves through mutepfe PMEs, and each PME becomes involved In the routing task lor 
each packet. Preplanning such traffic is generally not possible. 

To enable the network to be efficient lor all types of Master requirements, we partrdon, between the 

30 H/W and the responsibility tor data routing between PMEs, SW does most of (he task sequencing 
function. We added special features to the hardware (WW) to do the inner loop transfers and minimize 
software (©7W) overhead on the outer loops, 

I/O programs at dedicated interrupt levels manage- the network. For most applications, a PME dedicates 
four interrupt levels to receiving data from the four neighbors. We open a buffer a? each level by loading 

35 registers at the levei and executing the DM (rr uses buffer address and transfer count but does not await 
data receipt) and FSTVRN instruction pair. Hardware then accepts words from the particular Input bus end 
stores them to the buffer. The Duffer tun ccndfSon will men generate me interrupt and restore ft* program 
counter to Che instruction after the RETURN, This approach Is Interrupt levels permits LO programs id be 
written that do not need to test what caused fte Interrupt Programs reed data, return, and then continue 

40 directly into processing the data mey road. Transfer overhead is minimized as most sHuasone require Bfte 
or no register saving. Where an application uses syrKbror&ration on 1/0, as In the PDE example, then 
programs can be used to provide frat capability. 

Write operations can be started m several ways. For the PDE mampie, fit the point where a result Is to 
be sent to a neighbor, the appEcation level program executes a write calL The caJI provides buffer location, 

<s word count and target address. The write subroutine includes the register loads and OUT instructions 
needed to Wttets die HW and return to the application. H W does the actual byte by byte data transfer. 
More complicated output requirements win use an output service function at the highest interrupt k*eJ. Both 
application and interrupt level taste access that service via a soft interrupt. 

8etting up circuit switched paths builds on these slrnpis read and write operations. We start wBh all 

30 PMEs having open buffers sized to accept packet headers but not the data. A PME needing to send data 
instates the transfer by sending an address/data block to a neighboring pme whose address baiter rnatches 
the target In 4ie rwrghbering PME address information will oe stored; due to suffer fuD an interrupt wul 
occur. The Interrupt SArV tests (he target address and will either extend Ae buffer Do accept the data or write 
fts target address to an output port and set the CTL register for transparent data movement. (This allows 

55 the PME to overlap its application executions with the circuit switched bridging operation.) The CTL register 
goes to busy state and remains transparent until reset by the presence of a signal at end of data stream or 
abnormally by PME programming. Any nurriber of variations on these themes can be implemented. 
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System IfQand Array Orrsctor 

FIGURE 12 shows an Array Director tn the preferred •mbocfimenJ, which rn*y perform the functions of 
the controller ot FIGURE 1 3 which descrflx» the system bus to array connexions. FIGURE 13 b composed 
s or two pats, (a) (ha bus to/from a cluster, ml part (b) iie communication of Information on the bus taffrom 
a PME Loading or unloading the arsy Id done by connecting the edges of clusters so the system bus. 
Multiple sysrom buses can be supported with muWpte clusters. Each duster supports 50 to 57 Mbyte's 
bancfortdm. Loading or unloecfiog the parallel array requires moving data between an or a subset of the 
PMEe and standard buses (te MicroChannel, VMfcVbus, FutureBus, etc). Those buses, part of She heel 
in processor or array conftoflar. are assumed to be rigkdry specified. The PME Array therefore must be 
adapted te ma buses- The pme Array can be matched to the bancwdm of any bus by interleaving bus data 
onto n PMEs, with n picked to permit PMEs both I/O and processing Sme, FIGURE 1 3 shows how we mighi 
connect ihe system buses to the PMEs at two edges of a cluster. Such an approach would permit 114 
Mbyte/3 to be supported, it also permits data to be loaded at harf the peak r«e to two edges Simula- 
's neousry. Although this reduces tho bandwicft to 57 Mbyte's/cluster, rt has me advantage of providing 
orthogonal data movement within fee array and ability to pass data between two buses. (We use those 
advantages to provide fast transpose and matrix muHpry operation.) 

Ae shown h part (a) of FIGURE 13, the bus "dots to all paths on the edges ot the cluster; and, the 
controller generates a gate algnd Id each paih in the required rnrerteaye timing. If required to connect to a 
2> system bus wsh greater then 57 Mbyte's, the date win be interleaved over multiple cluster*. For example, in 
a system requiring 200 Mbyte's system buses, groups of Z or 4 dusters wM be used. A large MPP has the 
capacity to attach ie or 94 such buses to its xy network paths. By using the w end z paths in eddHion to the 
x and y paths, that number could be doubled. 

FIGURE 13 part (b) shows how the data routes to individual PMEs. Tha FIGURE shows one particular 

29 w«x,y or z path that can be operated at 7.13 Kfcytefe in burst mods, if the data on the system bus occurred 
in bursts, and 9 the PME memory could contain ihe complete burst then only one PME would be nxjured. 
We designed the PME I/O structure to require neither of these conditions. Data can be gated into the 
PMBcO at the fen rate until buffet full occurs. At that Instant PMBtO wiM change to transparent and PMBci 
wi begin accepting the data, Wfchin PMExO processing of the input data buffer can begat PMEa ftat have 

30 taken dala and processed H are limited because they cannot transmit the results white in flie trartepareni 
mode. The design resolves this by svritchrng the data stream to tie opposite end of the pern at intervals. 
FIGURE 13(b) shows that under SW control one might dedicate PMExO through PMExd to accepting data 
while PME&12 through PMEx15 unload results and itea-versa. The contrdter counts words and adds end of 
block signals to the date stream, causing me ewfctch to cfrecSoa One count applies to all paths supported 

33 by the controller so controller workload is reasonable. 

SYSTEMS FOR ALTERNATIVE COMPUTERS 

FIGURE 16 Illustrates a system block diagram for a host attached large system with a single application 

40 processor interface (APfj, This Ifostrafton may ateo be viewed with the understanding tha? our rnvenlten may 
be employed in stand alone system which use multiple appicabon processor interfaces (not shown* This 
configuration will support DASDYGrenpJca on all or many clusters. Workstation accelerators can eliminate the 
host, appilcsfion processor Interface (API> and cluster s\mchronber (CS> Illustrated try emulstton, Tha C$ 
not always required. H wi8 depend on Ihe type of processing feat is being performed, as wel as the 

49 physical drive or power provided for a particular application which uses our invention. An appScaflon this is 
doing mostly MIMD processing will not place as high a workload demand on the controler, so here the 
control bus can see very stew pulse rise times. Conversely, system doing mossy aayrclironous A-SIMD 
opereaons with many independent groupings may require faster control busing, in this case, a cluster 
synchronizer will be desirable. 

so The system block otegram of FIGURE 1 8 Illustrates that a system might conitet of host array comroilar 
and PME array. The PME array is a set ot clusters supported by a set ot duster controllers (CC). Although 
a CC is shown for each cluster that relationship is not seictry required. The actual ratio of clusters to CCb Is 
flexible. The CC provides redrtve to> end accumulation from the 64 BCIsfckisters. Therefore, physical 
parameters can oe used estabttsh the maximum redo. Addieonafly, the CC win provide tor controlling 

53 multiple independent subsets of the PME array; that service might also become a gating requirement A 
study can be made to determine these requirements tor any particular application ot our invention. Two 
versions ot the CC wfR bs used, A duster that te to be connected to a system bus requires the CC providing 
interleave controls (see System I/O and FIGURE IS) and ai-state drivers. A more simple version thai omits 
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the tri*6tefe busing features can also be employed. In the cast of targe systems, a second stage of redrive 
and accumulation la added. TWo level la me cluster aynchronlzsr rC£). lha eat of COb plus CS and the 
Appfica^on Processor Interlace (API) make up the Array Controller. Only the API is a programmable unU 

8everal variations of this system synthesis scheme will ba used. These result In different hardware 
3 configurations lor various appttcaHons, but Ehey do not have a major Impact on the supporting software. 

For a workstation accelerator, the duster comroflare win be attached directly to the workstation system 
bus; the function of the API wtfl be performed by the workstation. In the case of a RI8O8000, iha system 
bus la a Micro Channel end the CO units can plug cflrecify into the slots within the workstation. This 
configuration places the VO devices (DASO, SCSI and display Interfaces) on the same bus that 
ro kiadsAin toads the array. As such the parallel array can be used Cor I/O intensive taska like raaJ time image 
generation or processing. For workstations using other bus systems {VME-bus, FutureBua a gateway 
interface wMI be used. Such modules are readily avauabte in the commercial marketplace, Note thai in these 
minimal scale systems a single CC can be shared between a determined number or dusters, and nefifter a 
OS nor an API Is needed. 

ts A MIL avionics application might be simitar in sue to a workstation, but it needs different interfering. 
Consider what may become fte normal mIOlary situation. An etfsting platform must be enhanced with 
actional processing capability, but funding prohibits a complete processing system redesign. For this we 
would attach to the APAP array a smart memory coprocessor, to Are case, a special application program 
interface API that appears to the host as rnemory will be provided. Data addressed to the host's memory 

so wD then be moved to the array vie CC(s). Subsequent writes to mamory can be delected and interpreted as 
commands by ihe API so that the accelerator appears to be a memory mapped coprocessor. 

Large system* can be developed as either host attached or as stand alone configurations. For a host 
attached system, me configuration shown in RGURE 18 Is useful. The host will be responsible for UO, and 
the API would serve as a dispatched task manager. However, a large stand alone system is also posstbte in 

£$ special situations. For example, a database search system might efirntnat© the host, attach 0A$D to the 
Microcmanneis of every cluster and use multiple APIs as bus masters slaved to me PMEs. 

Zfrper Array Interface wHh Bqemal I/O 

so Our zipper provrttes a fast VO connection scheme and is accompSshed by placing a switch between two 
nodes of the array. Tills switch win allow for the parallel communication into and out of the array. The fast 
10 witt be Implemented along one edge of fine array rings and acts like a targe zipper into the X. Y, W, Z 
rings. The name "zipper connection* is given to the fast W>. Allowing data to be transferred into and out of 
the network while only adding switch delays to transfer the data between processors is a unique loading 

ss technique. The switching scheme does not disrupt the ring topology created by the X, Y, W, Z buses and 
special support hardware sQows the ripper operation to occur while the PE is processing or routing data 

The ability to bring data into and out of a massively p&altel system rapidly is en important 
enhancement to the performance of the overall system, We believe lhat the way we implement our fast 00 
without reducing the number of processors or dimension of the array network is unique in the field of 

40 massfvsty per&jlel environments. 

The modjfied hypercube anrangement oan be extended to permit a topology which comprises rings 
within rings. To support the interface to the external I/O any or all of the rings can be logically broken. Tne 
two ends of the broken ring can than be connected to external I/O buses. Breaking the rings is a logical 
operation so as to permit regular inter-PME communication at certain time inifirvata while permitting MO at 

<s other «me intervals. This process of breaking a level of rings wrthtn the mocWted hypercube effectively 
♦unzips 1 rings for VO purposes. The fast MO "zipper 11 provides e separata interface Into the array. This ripper 
may exist on f to n edges of the modified hypercube and could support either paraflel input into mumpte 
dimensions of rjie array or broadcast to multiple dimensions of the array. Further data transfers into or out 
of the array could alternate between the two nodes dlrecify attached to the zipper. This I/O approach b 

ss unique and it permits developing different zipper sizes to satisfy particular application requirements. For 
example, in the particular corrhguretion shown m FIGURE B, called ma large toe-grained processor 3B» the 
zipper for the Z and W buses will be dotted onto ihe MCA bus. This approach optimizes the maxnx 
transposition time, satisfying a particular application requirement for the processor. For a more detailed 
explanation of the "zipper" structure, reference may be had to the APAP VO ZIPPER application filed filed 

ss concurrently herewith. The zipper fo here illustrated by Figure U. 

Depending on the configuration and the need of the program to roll data and program into and out of 
§re individual processing elements, m else 02 the zipper can be varied. The actual speed of the MO zipper 
is approximately the number of rings attached limes Ihe PWE bus width, femes the PME dock rate ail 
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divided by 2, (The dVwion permas the receiving PME tone to move data onward. Since it cen send it to any 
of n pteose I/O contention to completely absorbed over the Array.) With easting technology, te^ S MBftec 
PME transfe* rate, 04 ring* on the* zipper, and interieaved to two node? transfer*, 320UB/$ec Array transfer 
rates are possible- (See tho typical zipper ronaguretion in FIGURE 14). FIGURE 1* illustrate* the test W) or 
s the so-called 'zipper connection' TOO, 710 which exists as a separate interlace Into trie array. This zipper 
may exist on one 700 oi two edges 700, 710 of the hyper cube network by doffing onto the broadcast bus 
720, 730, 740, 750, at multiple nodes in the array 751. 752, 753, 754 and In mutfple directions 770, 780, 
790*751. 755,767. 

Today 4 * MCA bus supports 60 to 160 MB per second buret transfer rate and therefore Is a good match 
io for a single zipper in simple or non-interleaved mode. The actual transfer rate given channel overhead and 
efficiency t$ something leas than that For systems that have even more demanding VO reqUreunento. 
multiple zippers and MCA. buses can be utilized These techniques are seen to be important to processors 
that would support a largo external storage associated wtih nodes or dusters, as might bo characteristic of 
database machines- Such VQ grow* capability Is comptefeJy unique to this machine and has not previously 
*5 been incorporated in either massively perauel, convention^ single processor, or coarse-gained parallel 
machines. 

Array Orrector Architecture 



to CHr massively parade! system & made up of nodat building blocks of multiprocess** nodes, dusters of 
nodes, and arrays or PMEs already packaged in dusters, For control or these packaged systems we 
provtcte a system array director which with the hardware controllers performs tlie overall Processing 
Memory Element (PME) Array Controller 1 functions in the massively parebet processing environment. The 
Director comprises of three functional areas, the Application Interlace. She Cluster Synchronizer, end 

29 normaHy a Clusta Controller. The Array Director wfil ha%-e fie overall control of the PME array, using the 
broadcast bus and our zipper connection to steer data and commands to ail of the PMEs. The Array 
Director functions as a software system interacting with the hardware to perform the rote as (he shell of the 
operating system. The Array Director In performing this role receives commands from the eppScatkm 
interface and issuing the appropriate array instructions end hardware sequences to accomplish the 

30 designated task. The Array Dfcector*B main function Is to continuously feed the tnstrucuons to the PMEs and 
route data in optimal sequences io keep the frame at a maximum and collisions to a minimum. 

The APAP computer system shown In FIGURE 6 fs iHus&ated m more detail m the diagram of FIGURE 
12 Yvhlch Illustrates the Array Director which can function as a controfler, or array controller, as ftistrsted In 
FIGURE 13 and FIGURES 16 and 19. This Anay Director 610 Bfuatnrted in FIGURE 12 is shown in the 

35 preferred embotfment of an APAP in a typical configuration of n Identfcat array dusters 505, 670, 680, 690, 
with an array director 610 tor the cluster* of 512 PMEs, and an application processor interface 930 for the 
application processor or processors 600. Tlie synchronizer 650 provides the needed sequences to tte array 
or duster eontroler 640 and together tfrcry make up the "Array Director" 610. The application processor 
interface 690 will provide the support tor the host processor 600 or processors and test/debug workstations. 

49 For APAP unto attached to one or more hosts, the Array Director serves as the interface between the user 
and the array of PMEs. For APAPs functioning as stand alone parallel processing machines, the Array 
Director becomes the host una and accordingly becomes involved In unit \fO activities. 

The Array Director will consist of the following tour functional areas: {see fte functional block diagram In 

FIGURE 1 2) 

49 1, Appftceflon Processor Interface (API) 600, 

2, Ouster Synchronizer (CS) 650 (0x8 array of dusters}, 

3, Ouster tortroOer fCC) 640 {& x I array of nodes). 

4, Fast VO (zipper Connection) 630* 

so The Application Processor Interface (API) 630: 

When operating in attached modes, one API will be used for each hosL That API wttl monrtor the 
incoming data stream to determine what ere Instructions to fte Array dusters 695. 670, 600, 690 and what 
are data for the Fast vo (zipper) 620. When in standalone mode. M API serves as the primary user 
89 program host 

To support these venous recrements, the apis contain the only processors wttnm the Array Director, 
plus the dedicated storage for the API program and comrrwds. Instructions received from the host can call 
for execution of API subroutines, loading of API memory with soYJtfensJ functions, or for loading of CC and 
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PME memory with now aw. As described in me 3W overview section, these various type requests can be 
resided to subset of users via the Initial programs loaded Into ihe API. Thus, the operating program 
loaded wai determine the type of support provided which can be tailored to match the performance 
capability of the API This torther permits the APAP to be adjusted to ihe needs of multiple users requiring 
3 managed and wbO tested services, or to the IndhrirJuaJ user wishing to obtain peak performance on a 
particular application. 

The API also provides for managing the path to and from the UO zipper. Data received from the host 
system In attached modes, or from devices in standalone modes is forwarded to me Array. Prior to initiating 
thfe type of operation the PMEs within the Array wwch wffl be managing the K> are Initiated. PME* 

10 operating in MIMD mode can ulifize the last interrupt cepabi&ty and either standard SAtf or special fanctbns 
tor this transfer whse those operating in SIMD modes would have to be provided detailed control 
instructions. Data being sent from the 10 zipper requires somewhat the opposite conditioning- PMEs 
operating in MIMD modes must signal ihe API via file high speed serial inierEace and await a response from 
the An, wrtfe PMEs to SIMD modes are already In synchronization with the API and can *e»efc»e 

S3 immsxfatory output data. The ability to system switch between modes provides a unique ability to adjust the 
program to the application. 

Cluster gyncfaronfeer (08) 650 



so The CS 650 provides fte bridge between the API 630 end CC 640- ft stores API 930 output in FIFO 
stacks and monitors the status being returned from me CO 650 (both parallel input acknowledges and high 
speed serial bus data) to provide trie CC, in timely fashion, wtth the desired routined or operations mat need 
to be storied. The CS provides the capacity to support different CCs and dffierent PMEs within clusters so 
as to permit dividing the array into subsets. This is done by partitioning the array and men commending the 

£s Involved cluster controllers to selectively forward the desired operation. The primary function of the 
synchronizer is to keep ail dusters operating and organized such that overhead erne fs minimized or burled 
under the covers of PM6 execution time We have described how ihe use oi the cluster synchronizer in A- 
SlMD ccoflgurations is especially desireble- 

so Cluster Corrtroaar (CC) 640 

The CC 840 interfaces to the node Broadcast and Control Interface (BCI) 605 for the set of nodes in en 
array cluster 635, (For a 4d modified rrypercube with 8 nodes per ring that means toe CC 340 Is attached to 
04 BCb 605 to an 8 by 8 array of nodes and is controlling 512 PMEs. Sixty-tour such dusters, also In a 8 

3s by 8 array, lead to the ful up system with 32788 PMEs,) The CC 840 wU send commands and dato 
suppled by toe CS 660 to the BCI parallel port and return the acktxi^edgement data to me CS 850 when 
operating in Ml MO modes, m SI MO mode the interface operates synchronously, and step by step 
acknowledgments are not required. The CC 840 also manages and monitors the high speed serial port to 
determine when PMEs within the nodes are requesting services. Such requests are passed upward to the 

40 CS 880 while the raw data from the high speed serial interlace ie made ewsifeble to the e&tus cfeptey 
interface. The CC 640 provides the CS 650 wish an interface to specific nodes wfthm the cluster via the 
standard speed seiial interface. 

In SIMD mode the CC wfli be directed to send Instructions or addresses to all the PMEs over me 
broadcast bus. The CC can dispatch 18 bit inBlmctun to all PMEs every 40 nanoseconds when in SIMD 

*s mode. By broadcasting groups of native instructions to the pme, the emulated instruction set is formed. 

When in MiMO mode toe CC will waft for die endbp signet before issuing new instruction to the PMEs, 
The concept of the NfflMD mode is to build airings of micro-routines wan nafre Instructions resident In the 
PUE. These strings can be grouped together to form the emulated fesfruciions, and these emubted 
instruction can be combined to produce service canned routines or library functions. 

se When In SIMD/MDrtD (SIMIMD) mode, the CC will Issue instruction as If In SIMD mode and check for 
endop signals from certain PME a. The PMEs mat are to MIMD wffl not respond to the broadcast instructions 
and will continue witti there designated operation. The unique status indicators will help the CC to manage 
this operation and determine when and to whom to present the sequential Instructions. 

35 OperetBonal Software Levels 

Th* application overviews the operational software S/W levels to provide further explanation of toe 
services performed by various hardware WW components. 
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Computer systems generally used hew* an operating system- Operating system feamete which are 
Felatfwtfy complete must bo provided to most massive MIWD machines, whom workstation ctaae CPU chips 
run kernel* such as Mach. The operating system kem*l $nppcnl$ message passing or memory cohewcy . 
Other massively parallel systems bared upon SIMD rnodete have almost no intellgenc© to the array. There 
s are no •program counters" out in the array, and thus no programs to execute tocafy. Afl instructoa are 
broadcast 

si the systems we have provided with our PME as the basis for duner arrays, there b not need tor an 
operaang system at each cttp, a node. We provide a Horary of bey function* for computation andtor 
oomnuMon wfthto each PE (PME> that can be invoked at a high level SUvttMIke Instructor* are 

io broadcast to Ihe array to set each of a selected sel of PMEs. These PMEs can then perform m te\ MIMD 
mode one or more of these library routines. In erjdtion. basic in&rupt hantHei and communteoSons routines 
are resident In each PME allowing the PME to handle communication en a dynamic basis. Unite existing 
MMD machines, (he APAP structure need not include an entire program in PME memory. Instead ai of [hat 
code, which is essentfaiy serial, Is the cluster oontroRer. Thus such code, 90% by space and 10% by time 

is (typically) can be broadcast in a SIMD teshio to an array of PMEs. Only the truly parallel inner loops are 
distributed to lha PMEs in a dynamic fashion. These are then initialed into MIMD mode fust as otier 
p Rbrary" routines are. This enables use of program models which are Single Program MuBp4e data to be 
used where the same program Is toacted in each PME node, witt embedded synchronization cede, and 
executed at the local PME, Design parameters affect bandwidth available on different links, and the system 

20 paths are ^grammatically configurable, allowing high bandwfth finks on a target network, and allowing 
dynamic partition of oil chip Kke PME-to-PME links to provide more bandwidth on specific panrhs as meets 
the needs of a particular application. The finks leaving a chip mete directly with each other, without the need 
tor extemaJ logic There are sufficient links and there Is no predesigned constraint as to which other ants 
they con etlach to, so (hat the system can have a diversity of miercormaciion topologies, wift muling 

29 performed dynamically and programmetry. 

The system allows usage of existing compilers and parser* to create an executable paraisJ program 
which could run on a host or workstation based configuration Sequential source code for a Single Program 
MuJBpie Data system would pass through program analysis, for examination of dependency, data and 
controls, enabling extension of program source to include can graphs, dependency tables, atetses, usage 

30 tables and the 1*0. 

Themfter, program transformation would oocm wnereby a modified version of the program would be 
created that extends the degrees of paraltetsm by combining sequences or recognizing patterns to 
generate expScIt compter directives, A next step would be a date allocation end partitioning step, with 
message peneratkm k which would analyze data usage pattemsnd allocate so frat elements to be combined 

33 would share common indexing, addressing pattern, and these would provide embedded program compiler 
directives and calls to communication services. At this point the program would pass to a level parfflcring 
step, A tevel partitioning step would separate the program info persons tor execution to ARRAY, in ARRAY 
CONTROLLER (array director or duster controller), and HOST, Array portions would be interleaved in 
sections with any required message passing synchronization functions. At thte point level processing could 

40 proceed. Host sources would pass to a level compter (FORTRAN) tor assembly compilation. Contrtfter 
sources would past to a microprocessor corrtrofer compiler, and Items thai would be needed by a single 
PME and not available in a library caS would pass to a parser (FORTRAN OR 0) to an intermediate level 
language representaiton which would generate optimized PME code and Array Controller code. PME code 
would be created at PME machine level and would include library extensions, which would pass on toad 

* into a PME memory. During execution a PME parallel program in me SPMD process of execution could call 
upon already cods d assembly service Functions from a runtime library kernel. 

Since the APAP can function as either an attached unit that m cJosely or loosely coupled wllh Its heat or 
as a stand atone processor, some variation in the upper level S/W models exists. However, these variations 
carve to integrate the various type applications so m to permit a single set of tower level functions to satisfy 

so all three applications. The explanation wiH address tie attached version S/W first and then the modifications 
required tor standatone modes. 

si any system, as frustrated by FIGURE 10, where the APAP is intended to adach to a host processor 
the user's primary program would exist within the host and would delegate to (he APAP unit tasks and 
associated date as needed to prevkte desired toad balancing. The choice of interpreting the dispatched 

S3 task's program within ttw host or the Array Director is a user option. Host level intenxetation permits the 
Array Director to work at interleaving users which do not exploit close control of the Array, while APAP 
interpretation toads to minimal latency In control branching but tends to limit the APAP time to perform 
muffi-user management tasks. This leade to the concept ihat the APAP and host can be tightly or loosely 
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coupled. 

Two examples ftustrate the ex Hemes; 

i. When APAP is <ctisa*e<l to 3090 class nuxtirm with Floating Point Vector FaciEfes. user dale in 
compressed form could be stored within the APAP. A host program thai called for a veotor operation 

6 upon two vectors with differing sparseness ctaracterlstrca would then send instructions to Ihe APAP to 
realign the data into element by element matching pajra, output the result to m vector Facility, read 
answer from the Veotor Fadirty end finally reconfigure data into final sparse data form. Segments of tne 
APAP would be Interpreting and bulking sparse matri* bit maps, while ofter sections would be 
calculating how to move data between PMEa such mat ft would be property aligned for the zipper, 

io Z. With APAP attached to a small inflight military computer, the APAP could be performing the entire 
wortdoed associated with Sensor Fusion Processing. The host might initiate the process once, send 
sensor data as B was received to the APAP and then wan tor results. Trie May Director would then have 
to schedule and sequence the PME array through perhaps dozens of processing steps required to 
perform the process. 

/s The APAP will support tores levels of user controfc 

1. Casual User. Sme works with suppSed routines and library function. These routines are maintained at 
fre host or API level and can be evoked by dm user via subroutine calls within Ms program. 

2. Customfcer User, S/fce can write specter Junctions which operas? within the API and which dtrectty 
evoke routines supplied with the API or services supplied with the CC or PME. 

50 3. Development User. Sfts generates programs for execution hi the CC or PME. depending upon API 
services far program load and status feedback. 

Satisfying these ttree user levels in either closely or loosely coupled systems leads to Die partitioning 
of HW control tasks. 

as API Software Tasks 

The application program interlace API contains 3W services that can tost (he leading words of data 
received and can determine whether that data should be interpreted by the API loaded to some storage 
within ihe Array Director or PME, or passed to the I/O zipper. 

so For data that is to be Interpreted, (he API determines the required operation and invokes Ae function. 
The most common type operation would can tor *e Array to perform some function which would be 
executed as a result of API write* to the CS (and indirectly to the CC), The actual data written to the CS^CC 
would In genera) be constructed by tha API operational routine based upon the parameters passed to the 
API horn the host. Data sent to the CS/CC would in torn be forwarded to the PMEs vis the node BCL 

3s Data could be loaded to etiher API storage. CC storage, or PME memory. Further, dara to be loaded to 
PME memory could be loaded vie either the I/O zipper or via the node BCL For data to be put into the API 
memory, the incoming bus would be read then written to storage. Data targeted fo the CC memory would 
be similarly read and then be written to the CC memory, Finalry. data for the PME memory (in ftie case 
normaly ne* oi additional M1MD programs) could be sent to all or selected PMEs via the C&CC/Node Bfl 

40 or ic a subset o$ PMEe tor sotective meitetrlbufien via the I/O zipper. 

When data is to be sent to the UO zipper, it could be preceded by inline commands that permit the 
PME UUMD programs to determine its ultimate target or, it could be preceded by calls to the API service 
functions to perform sifter MIMD Initiation or SIMD transmission. 

In addition to responding to requests lor service received via the host interface, the API program wrU 

« respond to reouest from the PMEs* Such requests wfii be generated on the high speed serial port and wffl 
be routed through the CC.'CS combination. Requests of this sort can result in the API program's dfrecfly 
servicing Ihe PMEs or accessing As PMEs via Ihe standard speed serial port to determine further qualifying 
data relative to the service request* 

so PME Software 

The software plan includes; 

o Generation of PME resident service routines (that Is, 'an extended f$A T ) tor complex operations and 
i/O management 

ss o Definition and development of controller executed subroutines (hat produce and peso control and 
parameter data to the PMEs via the BCI bus. These subroutines: 
i , cause e set of PMEs to do mathematical operations on distributed objects, 
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2. provkfe I/O data management art synchronization services for PME Array and System Bus 
Interaction*, 

3. provide startup program load, program overlay and program task management for PME*. 
o Development of date alocaSon support services tor host level program^ and 

s o Development of a programming support system including essemfcler. simuiafor, find H/W monitor and 
debug workstation. 

Baaed upon studies of mSrtary sensor fusion, opsmlzafcon, Image hatisforrnatlon, US Post Office optical 
character recognition and FBI fingerprint matching eppfcefons, we have conducted thet a parallel processor 
programmed wfth vector and array commands (Hk» BLA8 calls) would be effective. The underlying 
*> program*^ model must much the PME array characteristics feasible with today's technology. Specifi- 
cally: 

o PMEs can be Independent stored program processors, 
o The array can nave thousands of PMEs. and be suitable lor fine grained parallelism, 
o Intar-PME networks will have very high aggregate bandwidth and a small logical diameter*, 
n o But, by networtc connected microprocessor MIMD standards, each PME is memory Smtted. 

Prior rjrograrmtag on MIMD parallel processors has used task dispatching methodology. Such 
approaches lead to each PME needing access to an portion of a large program. This characteristic, In 
ccmWnaaon wfch the non-shared memory characteristic ot the HAW, *ould exhaust PME memory on any 
significant problem. We therefore target what we believe Is a new programming model, called 
a> "asynchronous SJMD* (A-SMD) type processing, Ws connection see U3SN 7©$,788, filed r>lovember *7. 
1891 of P. Kogge, which is hcorocrated herein 

A-SIMD prooramrning in our APAP design means thai a group of PMEs will be directed by commands 
broadcast to them as In 8IMD models. Hie broadcast command wir initiate execution of a MIMD function 
w&iin each PME. That execution can involve data dependent branching and depressing within PMEs, ml 
25 tO based synchronization wrth either other PME* or the BCI. 

Normally. PMEs wfll complete the processing and synchronize by reading the next command from the 
BCI. 

The A-SIMD approach includes both MaVO and 8MD operating modes. Since the approach imposes no 
actual time limits on (he command execution period, a PME operation that synchronizes on data transfers 

oo and executes indefinitely can be initialed. Such tunciions are very effective In data fiteffng, DSP, and 
6>6tofo operations. (They can be ended by either BCI interrupts or by commands over the serial control 
buses-} $IMD operation results from any A-SlMD control stream *>et does not Include MIMD Mode 
CommsfKte, Such a control stream can include eny of the PMEs native inamjcScns. These instructtona are 
routed direcfly to the instruction decode logic of the PME, Eliminating tie PME instruction fetch provides e 

35 higher performance mode for tasks that do not Invoke data dependent branching. 

This prograrnmtng model (supported by H/W features) extends to permitting fte array ot PMEs to be 
divided into Independent sections. A separate A-SIND command stream controls each socflort Our 
application studies show that programs of interest divide into separate phases («e* input, input buffering, 
several processlig steps, and output lormatSng. etc,), suitable for pipeline data processing. Fine-grained 

40 paraSelism results torn applying the n PMEs m a section to a program phase. Applying coarse-grained 
perifonaig to appficstkms often resufis in discovering small repetitive tasks sizable tor MIMD or memory 
bandwidth limited tastes suitable tor 8IMD processing. We program the MIMD poroons using conventional 
techniques and program the remaining phases as A-SftW sections, coded with vectorized commands, 
sequenced by the array controller. This makes the large controller memory the program store. Verytog the 

«s number of PMEs per section permhs balancing the workload. Varying the dspstohed ias* size permits 
balancing the BCI bus bandwidth to the control requirement 

The programming model also considers allocating data elements to PwiEs. The approach is to distribute 
daia dements evenly over PMEs, In early versions of this will be done by me programmer or by 
Wfe recognize that the IBM parallelizing compiler technologies apply to this problem and we expect to 

a> investigate their usage. However, the inter-PME banoSvidtN provided does tend to reduce the importantly o* 
this approach. This links data allocation and I/O mechanism performance. 

The H/W requires that the PME initiate data transfers out of Its memory, and It supports a controlled 
writo Into PME rramory wtihout PME program Irrvolvement input control occurs in to receiving PME by 
provttng an Input buffer address and a maximum length. When I/O to a PME results in buffer overflow. 

ss HrW evil interrupt the receiving PME The low level i/O functions mat wiH be developed for PMEs build on 
this servfce. We will support either movement ot raw data between adjacent PMEs or movement ot 
ackfcassed otafa between any PMEs. The last capability depends upon the circuit switched and store end 
forward rnextfartisms. The interpret address and forward operaaoo is important for performenoe. We have 



38 



EP 0 570 72$ A2 



optimized the HVr and &'W to support the operation. Using one word buff era results in on interrupt upon 
receipt or address header. Comparing uungel id Kith local Id permit© output path eetecttoru Transfer of the 
subsequent date words occurs in circuit switched mode. A slight variation on thte process using larger 
buffers results in a store and forward mechanism. 

5 Because of the high performance Inter-PME bandwidth, it is noi always necessary or desirable id place 
data efemeitse wrmm the PME Array carefufly. Consider shifang a vector daft element distributed across 
PMEs. Our arch lecture can send data without an address header, thus, providing tor very fast I/O. However, 
we have found* in many applications, that oottmteJng a data structure for movement In one direction, 
penaftzes data movement m an orthogonal dkecaon. The penalty fci such storatiorts approximates me 

io average cost of randomly routing data in the network. This leads to appicatbns where placing data 
sequentially or randomly {as oopoeed to arranging data* resute in shorter average process times. 

Many applications can be synchronized to take advantage of average access time. (For example* ROE 
relaxation processes acquire data from a neighborhood and thus, can average access over at least four I/O 
operations.) We believe that after considering the factors applicable to vector and array processes, like 

;s sctatengather or row'column arithmetic, many users will and brute force data allocation to be suitable for the 
appficatton. However* we know of examples that Illustrate application characteristics (Wee requ ir ed synchro" 
nizetion or biased utilization of shift directions'} that tend to force particular data attocetbn patterns. Tto 
characterteac requires that the toots and techniques developed support either manual tuning of the data 
placement* or simple and rum-optimum data allocation. (We wfll support the non-optimum data allocation 

20 strategy wim host level macros to provide near transparent port of vectorized host programs to the MPP. 
The H/W Monitor wortetaaon will permit the user to investigate the resuham performance.) 

FIGURE 19 shows the general SAW development and usage environment The Host Application 
Processor is optional in thai program execution can be controlled from efther the Host or we Monitor, 
Further, the Monitor will effectively replace Bte Array Controller is some situations. The environment wffl 

£S support program execution on real or simulated MPP hardware. The Monitor Is scenario driven so that the 
developer doing test and debug operations can create procedures to permit effective operation at any level 
of abstraction. 

FIGURE 20 illustrates tte levels of H/w supported within the MPP and the user interfaces to these 
Isvsts* 

30 Ws see two potential application programming techniques far the MPP. in the least programmer 
intensive approach, applications would be written in a vectorized high order language, if the user <*d not 
feel that the proWem warranted tuning data placement than he would use compile time servwee to allocate 
data to the PME Array. The appJIcaUon would use vector cafe like BLAG dial would be passed to the 
controller for interpretation and execution on the PME Array, Unique calls would be used to move data 

ss between host and PME Array. In summary, the user would not need to be aware of how the MPP organised 
or processed the date- Two opt^lsatkm techniques will be supported for this type application: 

1. AJtedng the data allocation by constructing me data allocation tabte will permit programs to force data 
placements. 

2. Generation of additional vector commands for execution by the array controller win permit tuned 
<o siAtancttons <fe, calling the Oeueeten Efiminaflon as a single operation, > 

We also see that the processor can be applied to specialized applications as in those referenced in the 
beginning of this section. In such cases, code tuned to the application would be used. However, even in 
such applications the degree of tuning wfll depend upon how important a particular task Is to fte application, 
B is in this situation that we see the need for taste individually suited to SIMD, MDtfD or A<8IMD modes, 
u These programs will use a combina^on oft 

1. Sequences of PME native instructions passed to an emulator function wthki the array controller. The 
emulator wiB broadcast the instruction and Us' parameters to the PME set The PMEs in this 6IMD mode 
wi3 pass the instruction to the decode function, simulating a memory fetch operation. 

2. Tight inner loops that can be I'D synchronized will use PME native ISA programs- Alter initiation from 
so a SIMD mode change, they would run continuously In Ml MD mods. (The option to return to SIMD mode 

via a 'RETURN* instruction exists.) 

3. More compfcafsd programs, as would be written in a vectorizing command set, would execute 
subroutines in the array controller that Invoked PME native functions- For example a simplified array 
controller program to do a BLAS 'SAXPr command on vectors loaded sequentially across PMEs wouM 

1 Gaussian Elimination with normal pivoting rewires shifting rows buf not columns. More than fci performance 
difference would resulr from arranging ^ ^ a euon tho j 0O j umn & wars on the fast shift direction. Even wirti 
that there is not an advanifiQe to Arranging rows in any particular relationship to En* &use*> 
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start sequences wlhln the PMEs mat 

£i Enable PMEb with required * elements vta comparison of PME Id wtih broadcast eiex* and 
* joW values, 

b. Compress to x values via a write to consecutive PME*, 
s c. Calculate the address of PMEs with y elements from broadcast data, 

d. Transmit ma rjompressea x data to me y PMEs, 

e. Do a single precision floating point operation In PMEs receiving x values to complete the operation. 
Hna%\ me SAXPY example fflusfcates one additional aspect of executing vectorteed apsScetion 

programs. The major steps execuie m itie API and could be programmed by either an optimizer or product 
jo developer. Normal/, the vectorued application would call rather than include this level o code. These steps 
would be written as C or Fortran code and wffl use memory mapped read or writes to control the PME array 
via the BCI bus* Such a program operates the PME array as a series of MIMD steps synchronized by 
returns to Ehe AR program, Minor steps such as the single precision floating point routines would be 
developed by *e Customer or Product Developer These operations wtl be coded using the native PME 
is ISA md wM be tuned to the machine characteristics. In general, Bits would be th« domain ol the Product 
Developer since coding, lest and optimization at this level require usage of the complete product 
development fool eat 

The APAP can have applications written in sequential Fortran. The peth fs quits different FIGURE 2% 
outlines a Fortran compiler which can be used. In the Erst step* II uses a portion of the existing paralieflihg 
20 compiler to develop program dependencies. The source plus these tables become an input to a process 
that uses a c+taracterizadon of the APAP MMP and the source to enhance parallelism. 

This MMP is a rwt-stered memory machine and as such allocates data between the PMEs for meal 
and global memory. The very fast data transfer time* and the high network bandwidth reduce the time 
affect ol data evocation, but it stil is addressed. Our approach treats part of memory as global and uses a 
3 SArV service function. It ie also possible to use the dependency information to perform the data allocation in 
a second afiemaave. The final step in converting me source to mufipte sequential programs is performed 
by the Level Partitioning step This partitioning step Is analogous to the t Fortran sup 3:ef work being 
conducted wtth darpa funding. The last process In the compilation Is generation of the executable code at 
all individual functional levels. For the PME this wilt be done by programming she code generator on an 
ao existing compiler system. The Host and API code compilers generate the code targeted Id those machines. 
The PA* can execute Ml/ID software from He own memory. In general, the multiple PMEs would not 
be executing totally different programs but rather would be executing the same small program in an 
asynchronous manner. Three basic types of SAiY can be considered athough the design approach does not 
imtt the APAP to just these approaches: 
ca 1. Specialized emulaton functions would make tie PME Array emulate me set at services provide by 
standard user ftbrertes nice UNPACK or VP$& to such an emulation package, the PME Array could be 
using its multiple set of devices to perform one of to operations required in a normal vector can This 
type of emulation, when attached to a vector processing unit, could utilize Che vector unit for some 
operator* whfle performing others toternaJry. 
* 2. The pereffefiern oS the PME Array could be exploited by operating a set of software that provided a 
new set of mathematical and service functions in the PMEs. This set of primitives would be the codes 
exploited by a customizing user to construct his appBcation. Tne prior example of performing sensor 
fusion on a APAP attached to a military platform would use such an approach. The customrzer would 
write routines to perform Kalman Filters, Track Optimum Assignment and Threat Assessment using trie 
« supplied set of function names. This application would be a series of API can statements, and each call 
would resufi in initiating the PME set to perform some basic operation like 'matrix mutupry* on data 
stored within the PME Array. 

3. In cases where no effective method, considering performance objectives, or application needs exists 
then custom &W could be developed and executed whhfn the PME. A specific example is k $orf a . Many 

so methods to sort data exist and the objective In afl cases Is to tune the process and the program id the 
machine architecture. The modified hypercuba ia we* suited to a Batcher Sort; however, mat son 
requires extensive calculations to determine parttcutar elements to compare versus very short compari- 
son cydes. The computer program in FIGURE 17 shows a simple example of a PME program iiOO to 
perform the Batcher Sort 1 000 with one element per PME. Each one of the program description would be 

as expanded to 3 to 6 PME machine level instructions, and al PMEs would ton execute the program in 
MM) mode. Program synchronization is managed via the I/O statements. The program extends to more 
data elements per PME end to very large parens! sons in a quite straight forward manner. 
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CC Storage Contents 

Date from the CC storage is used by the PME Array in one of two manners. Whan the PMEs are 
operating in 8IMD, a series of instructions can bo fetched by the CC and passed to the node BCI. thus, 
a reducing load on both tlie API and CS, Alternatively, functions that are not fireopenUV required, such as PME 
Fault Reconfiguration &W, PME Diagnostics, and perhaps conversion routines can be stored in the CC 
memory- Such functions can then be requested by operating PME MIMD programs or moved to the PMEs 
at the request ot API program directives. 

io Pgcjaglng of the frWay Modified Hyperpube 

Our packaging technique take advantage or toe eight PMEs packaged in a single chip and arranged m> 
a NHimansionai modified hypercube configuration. This chip level package or node of the array fs tha 
smallest bulking Hoc* in the APAP design, These nodes are *»en packaged in an 8X 8 array where the +- 

/s X end the *-Y makes rings wiWn the array or cluster and the end *-Z are brought out to the 
neighboring clusters. A grouping of clusters make up an array. This step significantly cuts down wire count 
for data and control for the array. The w end Z buses win connect to the adjacent dusters and faun W end 
2 rings to provide total connectivity around the completed array of various sfze. The massively parallel 
system *S be comprised of i>ase cluster building blocks to form the massive array of PMEs. The APAP 

so wa consist of an 8 X 8 array erf ctafers, each duster will have its own controller and ell the controls «« 
be synchronized by our Array Director. 

Many trade-offs of wtreebinty and topology have been considered, yet wan these considerations *e 
prefer the conBguraSon which we ifostrate with this connection- The concept disclosed has tha advantage of 
keeping the X and V dimensions within a cluster level of packaging, and distributing the YV and Z bus 

£0 connections to a» the neighboring dusters. 

After impbmarvtjng the techniques described, ft* product will be wireabte. and manufacturdble white 
maintaining &e inherent characteristics of the topofogy defined. 

fee concept used here is to ml*, match, and modify topologies at different packaging levels to obtain 
the desired results in farms ot wire count. For the me^od to deine the actual degree of modification of the 

so hypsroube, refer to (ho Rolfe modlflad hypercube patent application referenced above. For the purpose of 
this preferred embodiment we will desctfce two packaging Jewels to simpJfy our description. It can be 
expended. 

The first la the chip design or chip package illustrated by FIGURE 3 and FIGURE 11 . There are eight of 
the processing elements with their associated memoiy and oommuiuce?on logic encompassed into a single 

3s chip which Is defined as a node. The internal configuration is classified as a binary hypercube or a 2-degree 
hypercube where every PME is connected to two neighbors. See the PME-PME communication diagram in 
FIGURE a especially 500, 510. 520. 530. 540. 55a 550. 570. 

The second slap is that the nodes are configured as an ft X 8 array to make up a cluster. The foOy 
populated machine is buUt up of an array of 8 X & clusters to provide the mexbmum capacity of 32798 

40 PMEs. These 4095 nodes are connected in an 8 degree modified hypercube network where the commu- 
nication between nodes is programmable. This ability to program different routing paths adds fiexibifiy to 
transmit different length messages, In add-on to message length cfifferenoes, there are algorithm optimis- 
ations that can be addressed with these programmab)llty features. 

The packaging concept is intended lo significantly reduce [he off page xwre count tor each of the 

4$ dusters. This concept takes a duster which is defined as a 8 x 8 array of nodes 820. each node 825 having 
8 processing elements for a total of 512 PMEs, then to limit the X and Y ring vvNhin the cluster end, finally, 
to bring out the W and Z buses to afl dusters. Tha physical picture could be envisioned as a sphere 
configuration 800, 8i0 <rf 64 smaller spheres 830. See FIGURE 15 lor e future packaging picture which 
illustrates the fufl up packaging techniques limiting the X end Y rings 800 wHhin the duster and extending 

so out the W and Z buses to all clusters 810. The physical picture could be envisioned as a sphere 
configuration of 64 smaller spheres 830. 

The actual connection of a single node to the adjacent X and Y neighbors 975 exists vrtthin the same 
cluster. The wiring savings occurs when the z and w buses are extended to the adjacent neighboring 
dusters as Illustrated in FIGURE 15, Also illustrated in FIGURE ie is the set of the chips or nodes that can 

so be configured as a sparsely connected ^dimensional hypercube or torus d00\ 90S, 910, 915, Consider each 
of the 8 external porta to be labeled as +x. + Y« +z, +w, -X, -Y, -w 950, 975. Then, using rn chips, a 
ring can be constructed by connecting the +X to -X ports. Again m such rings can be interconnected into a 
ring of rings by intercconecfing the matching + Y to -Y ports. This level of structure wi be called a cluster. 
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It provides for 512 PMEs and will bo the building block lor several size systems. Two such connections 
(950, 075) are shown In the diagram illustrated In FIGURE 16. 

AppfcatkHTS far Owfcskte MPP- 

TUd desks*!* MPP in a workstason can be effectively applied in several application areas Inducing: 

1, SmeJ production tasks that depend upon compute intensive processes. The US Postal Service 
requires a processor that can accept a fax Image of a machine printed envelope and than find and read 
the zip code, The process Is needed at an regional sort facilities and ta en example of a very repeiHve 

3o but sfl compute intensive process. We have implemented APL language versions of a sample of the 
required programs. These models emulate the vector and array processes that will be used to do the 
work on tie MPP, Baaed upon this test we know that the task is an excellent match to the processing 
architecture. 

2, Taste in which an analysts as a result of prior output, or expected needs requests sequences of data 
is transformations. In an example taken from the Defense Mapping Agency, sateU&t Images are to be 

transformed and smoothed pteel by pixel Into some other coordinate system, to such a situation* the 
transformation parameters tor the image vary across localities as a resit* of ground elevation and slope. 
The analyst must fteretore add fixed control points and reprocess transformations. A similar need occurs 
in toe ulfiliallew of scientific simulation results '/men users require almost real time roteikm or per&peo 
20 trve changes. 

3, Program development for production versions of the MPP wis* use workstation size MPPs. Consider a 
tuning process that requires analysis ol processor versus network perfewwonce. Such a task ts machine 
and encyst Interactive. H can require hours when the machine is idle and ma analyst is v/dting. When 
performed on a supercomputer U would be very costly. However, providing an affordable workstation 

20 MPP with the same (but scaled} characteristics as the supercomputer MPP eliminates cotts and eases 
the feat and debug process by eliminating ins programmer foefticteftcies related to accessing remote 
processor*. 

FIGURE 22 is a drawing of the workstation accelerator. It uses the same size enclosure as the 
Rrsc/6000 model 530. Two swing out gates, each containing a fun cluster ere shown. The combined two 
a> clusters provide 5 GOP3 of fixed point performance and 530 MfbpS of processing power and about 100 
Mbyte/a of I/O bandwida to the array. The unit would be suitable to any of *e prior applications. With 
quantity production and hxtoefing a host RlfiCtfQGo, it would be price comparable with high performance 
workstations, not at the price of wxnparabte machines employing old technology. 

as Description of the AWAC3 Sensor Fusion 

The miliary environment provides a series of examples starring the need tor a hardened compute 
intensive processor. 

C*mrnunlcation in the targeted noisy environments tmptes the need for digitally encoded communica- 
nt tons, as is used m IONIA systems. The process of encoding the date for transmission end recovering 
information after receipt Is a compute intensive process. The task can be done will speciataed signal 
processing modules, but for situations oner* cemnrKinrcatton encoding represents bursts of activity. 
specJaHzsd modules are mostly idle. Using the MPP permits several such tasks to be allocated to a single 
module and saves weight, power, volume and cost 
<* Sensor data fusion presents a paracuteriy clear example of enhancing an extefrg platform with the 
compute power gained from the addition of MPP. On fte Air Force E3 AWAC9 there are more man four 
sensors on the platform, but there is currency no way to generate tracks resulting from the Integration of all 
available data. Firmer , *e existing generated tracks have quite poor quality due to sampling characterises. 
Therefore, there is motivation to use fusion to provide en effective higher sample rate, 
so We have studied ft is sensor fusion problem rh detail and can propose a verifiable and effective solution, 
but ftat solution would overwhelm the compute power available in an AWACS data processor, figure 23 
shows me traditional track fusion process- The process is faulty because each of *e Individual processes 
(ends to make some errors and the final merge tends to collect them Instead of eliminating them. The 
process is also characterized by high time latency in that merging does not complete until the stewest 
sb sensor completes. FIGURE 24 presents an improvement and the resulting compute intensive problem with 
the approach. Although we cannot solve a NP-Hard problem, we have developed a good method to 
approximate the solution. While me details of that appfceation are being described by the inventors 
elsewhere, as it can be employed on a variety of machines Gke an Intel Touchstone wife 512 i860 (60960) 
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processors, and IBM's Scientific Visualization System, it can be used as an application suitable tor the MMP 
using the APAP deeign described here with say 12BJN0 PMEs, substantially outperforming these other 
systems. Application experiments snow the approximation quality below the level of sensor noise and ae 
such the answer Is applicable to applications lite AWACS- FIGURE 25 shows the processing bop Involved 

6 in (he proposed Lagrangean Reduction n-dimensional Assignment algorlflim. The problem uses very 
controlled repetitions of me wen known 2-£m*n£onal assignment problem, the same algorHhm that 
classical sensor fusion processing uses. 

Suppose for example that the n-ciimonsionaJ alportfhm ^es to be applied to the seven sets of 
observations illustrated in FIGURE 24 end further, suppose that each pes* through a reduction process 

io required four iterations through a 2d Assignment process. Then the new 6d Assignment Problem would 
require 4000 iterations of the 2d Assignment Problem* The AWACS' workload is now about 90% of machine 
capacsy. Fusion perhaps requires 10% of the total effort but even that smart effort when seated up 4000 
times results in total ulifeation being 370 Ernes 0» capacity of an AWACS. Not enry does this workload 
overwhelm the existing processor, but it would be marginal In any new MIL environment suited, coarse* 

;s grained, parol!*! processing system currently erisang or anticipated In the next few years. If the algorithm 
required an average of 8 rather than 4 iterations per step, then it would overwhelm even the hypothesised 
systems. Conversely, the MPP solution can provide the compute power and can do so even at the 8 
iteration to vet 

s> Mechanical Packaging 

As illustrated In FIGURE 3. and other FIGURES, our preferred chip is configured In a quadualpeck form. 
As such it can be brfckwalfcd into into various 2 D and 3 O conffguralions in a package. One chip of efeht or 
more processor memory elements is a first level package module, the same as a single DRAM memory 

2* chip Is to a foundry which packages fte chip. However* It Is In a quadMpack form* allowing connectors to 
one another in four directions. Each connection is point to point. (One chip m its first level package is a 
module to (he foundry.) We are able to constate! PE arrays of sufficient magnitude to hit our performance 
goals due to this feature- The reality is ihef you can connect these chips across 3, 4 or even five feet pomt- 
o-po»x u. mutt-processor node to node, and still have proper control without the need of fiber opacs, 

so This has an advantage for the drtv&tecetve circuits that are required on the modules. One can achieve 
high performance and keep the power dissipation down because we do not have bus systems that daisy 
chain from module to module. We broadcast from node to node, but this need not be a high performance 
path. Most data operations can be conducted in a node, so data path requirements are reduced. Our 
broadesst path is essentially prlmanTy used as a controller routing tool. The data stream afacnee to and 

35 runs in. the ZWXV comrnunlcation paft system. 

Our power dissipation is 22. watts per node module for our commercial workstation. This allows us to 
use air cooled packaging. The power system requRernents for our system are also reasonable because of 
this fact Our power system illustrated multiplies the number of modules supported by about 25 *Btu? per 
module, and such a five volt power supply Is very cost effective. Those concerned with the amount of 

40 electricity consumed would be aatoniehed that 32 microcomputers could operate with less than the wattage 

consumed by a reading Eght 

Our thermal design b enhanced because of die packaging. We avoid hot spots due to high dissipating 
parts mixed with tow dtssJpfittng ones. This reflects directly on the cost of the assemblies. 

The cost of our system is very attractive compared Eo the approaches that put a superscalar processor 

45 on a card. Our performance fe^eJ per assembly per watt per connector pet pert type per dollar is excellent 
Furthermore, we do not need the same number of packaging levels that *e other technology doss. We 
do not need mc<lule/cardmackplane and cable. We can skip the card level if we want to. Ae illustrated In our 
workstation modules, nave skipped the card level wqh our bnekweifed approach. 

Furthermore, as we Ulustratsd m our layout each node housing which is brick waited in the workstation 

so modules* can as illustrated In FIGURE 3 comprise multiple repflcaled dies, even within ihe same chip 
housing, While normally we would place one die within an air cooled package, it is possible to place 8 die 
on a substrate using a multiple chip module approach. Thus, the envisioned watch with 32 or more 
processors, is possible, as well as many other applications. The packaging and power and flexibility make 
applications which are enotesa, A house could have its controflabte instruments at) watched, end coordi- 

5s naiad with a very small part Those many chips that ore spread around an automobile for engine watching, 
brake adjustment and so on could an have a monitor within a housing, in addition, one the same substrate 
wfth hybrid technology, one could mount a 388 microprocessor chip with full progiammable capability and 
memory (all In one chip) and use it as the arrey controller for Ihe substrate package. 
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Wis have shown roeny confirmations or systems, from control systems, FIGURE 3, to larger and larger 
eyetema. The atflity to package a chip wfth muaJpio proceeeor memory element of eight or more on a chip 
ai a dipt with pinouf* ft ting In « standard DRAM memory module, such as in a SIU module make possible 
countess additional applications ranging from controls to waH size video cftspteys which can have a 

s repetition rale, not a (he 18 or so frames that press the existing technology today, but at 00 frames, wtfli a 
processor assigned to monitor a pixel, or a node only a few pixels. Our bridwal) qu2drJar4*c* mates it easy 
to repfcaie the same part time over and oust again. Furthermore, the replicated processor Is realy memory 
w&h processor interchange. Part of the memory can be assigned to a specific monitoring task, and snorter 
part (vHth a size progjwnmatlcalry defined) can be a massive global memory, addressed potnt-to-polnt, vrfth 

to broadcast to all capability. 

Our basic wtfce&tlon, our supercomputer, our oont?oHer, our AWACS, ai are examples of packages 
that can employ our new technology. An array of memory, with inbuilt CPU chips and W, functions as a 
PME of massively parallel applications, and even more smiled applications. The llexibiity of packaging and 
programming makes Paginations expand and our technology allows one pan to be assigned to many Ideas 

is and Images. 

MSitary Avionics Appicaftons 

The coat advantage of constructing a Ml MRP Is particularly wel illustrated by the AWACS. U is a 20 
a? year old encioeiire that ha* grown empty space as new technology memory modules have replaced tin* 
original core memories- FIGURE 26 shows a MIL qudifiable two cluster system that would fit directly into 
the rack's empty space and would use the existing memory bus system for ^connection. 

Although fte AWACS example is very aoVantageou* due to the existence or empty space, in other 
systems ft b possible to create space- Replacing existing memory with a amen MPP or gateway to an an 
29 Isolated MPP is normally quite viable. In such cases, a quarter cluster and a adapter module would result In 
a 8 Megabyte memory P*us 649 MPs and use perhaps two slots. 

Supercstnputer Application 

a> A 64 cluster MPP la a 13.6 Stop wjpercompuler. it can be configured in a system described in FIGURE 

27. 

The system we describe allows node chips to be brick wetisd on cluster cards as illustrated m FIGURE 27 
(o buM up sy slams wltfi some significant coat and size edh/antages. There is no need tc Include extra chips 
such as a network switch in such a system because it would increase costs. 

as Our Interconnection system with "brfctfcwalled" chips allows systems to be built like massive ORAM 
memory Is packaged and «*■ have a deened bus adapter conforming to the rigid bus specifications* tor 
instance a mtaocftannsl bus adaptor. Each system wM have a smaller power supply system and cooling 
design than other systems based upon many rnootom preprocessor 

Unlfoe most supercomputers our current preferred APAP with floating point ernufafeon Is much fester In 

4* integer arithmetic (164- CI PS) then & is when doing floating poim arithmetic. As such, me processor would 
be most effective when used in applications mat are very character or integer intensive- We have 
considered three program challenges which in addition to the other applications dscussed hereto are 
neediul of solution. The applications which may be more important lhan some of the 'grand chaflenges" to 
day lo day ife include: 

<* 1, 9060 Vector Processors contain a very high performance floating point arfthrnette unit That unit, as do 
most vectorized floating point units; requires pipeline operations on dense vectors- Applications that 
make extenstoe use of non-regular sparse matrices (i.e. matrices described by btt maps rather than 
tfagoneis) waste the performance capacity of the floating point unit The MPP solves this problem by 
providing the storage for the data and using its compute power and network bandwidth, not to do the 

so calculation but rather to construct dense vectors, and to decompress dense results. The Vector 
Processing Umt is kept busy by a continual flow of operations on dense vectors being supplied to It by 
the MPP, By sizing the MPP so thai it can effectively compress and decompress at the same rate the 
Vector Fsofity processes* one could Keep both units fusy busy. 

2, Another host saached system we considered t* a solution to the FBI fingerprint matching problem, 
as Here, a machine with more than 64- clusters was considered. The problem was to match about 6000 
flngerprinte per hour against the entire database of fingerprint history* Using massive daSD and the full 
bandwidft of the MPP to host attachment, one can roll the complete data base across the incoming 
prints in about 20 minutes, Operating about 75% of the MPP In a 81 MD mode coarse matching 



44 



EP 0 670 729 A2 



operation, balances processing to required throughput rats. We estimate ftat 1 5% of the machine In A- 
SlMD processing mods would (hen complete the matching by doing the detailed verification of unknown 
print versus to print for C0$89 passing the oo»«e filter operation. The remaking portions of ihe machine 
were in MIMD mode and allocated to reserve capacity, work queue managemarrt and output formatting, 
3 3. Application of the MPP to database operations has been considered. Although the work is very 
pr seminary, h does seem to be a good match. Two aspects ot the MPP support this premise: 

a. The connection between a cluster Controller and the Applcation Processor Interface is a Micro- 
Channel. As such! It could be populated with DASD dedicated to fee duster end accessed directly 
from the cluster, A &4 cluster system with six 640 Mbyte hard drives attached per duster would 

w provide 246 Gbyte storage. Further, that entire database could be searched sequentially in 10 to 20 

seconds. 

b. Databases are generally not searched sequentially. Instead they use many levels of pointers. 
Indexing of databases can be done wfchin the cluster. Each bank of DASD would be supported by 2.5 
GIPS of processing power and 32 Mbyte of storage. That is sufficient for bosh searching and storing 

;s the indices. Since inrjbees are now frequently stored within the DASD, significant performance gains 

would occur. Using such an approach and dispersing DASD on SCSI Interfaces attached to tha cluster 
MicroChannel permits effectively unlimited size data bases. 
FIGURE 27 is an illustration ot the APAP when used to build the system into a supercomputer scaled 
MPP. The approach reverts to replicating unite, but here It is enclosures containing Id clusters that are 
so replicated. The partcuier advantage <rf this repneafon approach is that the system can be scaled to suit the 
user's needs. 

System Architectuie 

£S An advantage of the system architecture which is employed in the current preferred embodiment is the 
ISA system which will be understood by many who will form a pool for proommming the APAP. The PME 
ISA consists of (he following Data and Instruction Formate, illustrated to the Tables. 

Data Formats 

30 

The basic ^operand} size is the 16 bit word. In PME storage, operands em located on integral word 
boundaries. In addition to me word operand size, other operand sizes are available in multiples of 16 bits to 
support additional functions. 

Within any of the operand lengths, the bit positions of the operand are ronsecuHvety numbered from 
35 left to right starting wtth the number 0, Reference to high-order or most-signtficant bits always refer to the 
left-most bit positions. Reference to the lowwter or leesH^iftcem bits always refer to me ric/rn>mo$t bit 
positions. 

Inafruction Formats 

40 

The length of an instruction format may either be 16 bits or 52 bits* Si PME storage, instructions must 
be located on a 18 bit boundary. 

The following general Instruction formats are used. Normal ry, the first four bits of an Instruction define 
the operation code and are referred to as ihe OP bits, hi some cases, additional bits are required to extend 
4& the definition of the operation or to define unique conditions which apply to the instruction. These efts are 
referred to as 0PX bits. 
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Format Code 


Operation 


RR 


Reflisief to Regtefcr 


OA 


Direct Address 


R8 


Register Storage 


Rl 


Register Immediate 


SS 


Storage to Storage 


SPC 


Special 



Afl tormats have one field in common. This field and tts interpretation is: 

Bis 0-3 Operation Code - This ttekJ> sometfmes in conjunction with on operation code extension 
>5 field, defines the operation so be performed. 

Detailed figures of iha Individual formats along «Hh Interpretations of Iheir fiekfe are provided In the 
toSowieg subsections. For some irtsavcsona, two formats may bo oombbed to form variations on the 
Instruction, These primarily Involve the addressing mode for the instructor* As an ©xsmpte a storage to 
storage Instruction may have a form which Involves eSrect addressing or register addressing, 

20 

RR Fumiat 

The fteglster-Reglsier (RR) formal provides two general register addresses and Is it bus m length as 
shown. 



OP 

1 1 i 


RA 
_J ' 1 


0 0 0 0 

III 


Rfi 

i i i 











0 3 4 7 B 11 1 
1 2 5 



hi addition 10 an Operation Code held! the RR format contains: 

ests 4-7 Register Address. 1 - The RA field te used to specify which of the 16 general registers is 

to bo used as an operand and** destination. 
BSte 8*11 Zeros - Bit 3 being a zero defines the format to be a RR or DA format and bfe 
40 equal to zero define tteooerason to te a register to register operation <a special case of 

the Direct Address format). 
1% 12-15 Register Address 2 - The RB field is used to specify which of the 16 general registers la 
to be used as an operand. 



46 da Format 



The Direct Address <DA) format provide* one general register address and one direct storage settees 
as shown. 
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OP 

1 1 1 


RA 

r i i 


O 


DI ft ADDfl 
i i i i t i 


0 3 


4 7 


8 9 ! 



10 



In add-on to an Operation Cods field, fht DA format contains: 

BBS 4-7 Register Address 1 - The RA field is used lo specify which ol the 16 general registers ts 
to be used as an operand andtor destination. 

Zero - This bit being zero defines the operation to be & direct address opereiGn or a 
register to register operators 

Direct Storage Address - The Dbect Storage Address field Is used as en address info 
the level unique storage block or the common storage block. Bits of the direct 
address Beld must be non-zero to define the direct address form. 

RS Format 

The Register Storage <RS) format provides one psnerai register addressee end an indeed storage 
address. 



;s Bit 8 

Bits 9-16 

30 



OP 

r i i 


RA 

1 f 1 


1 


DEL 

. J L 


R8 
.1.1 i . 


9 3 


4 7 


6 9 1 
1 


1 1 

2 5 



In addition to an Operation Code field, ms R8 format contains: 

Bfte 4-7 Register' Address 1 - The RA Bold is used to specify which of fro 16 pnerg register* ts 

to be used as en operand end/or desana&cn. 
Bit e One -Thfeba being one defines the operation to be a register storage operation. 

Bite 9-11 Register Data - These bite are considered a signed value which is used to modify the 

contents of register specified by the RB field. 
Bits 12-15 Register Address 2-ThsRB tteid is used to specify which of the 16 general registers te 

to be used as an storage address for an operand 

Rl Format 



The Register-hnmediete (Rl) format provides one general register address end 16 bits of immediate 
dsia. The Rl format is 32 bite of tength as shown: 



so 
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8 


OP 


ftA 


1 


DEL 


9 0 9 0 


IMMEDIATE DATA . 




-1,1 1 


..111 




_J_J. 


1 * 1 


Illiiifii 



3 4 



JO 



7 8 9 11 

1 Z 



1 

5 



1 



In addition fto an Operation Coda few, the HI format contains; 



rat» 4.7 

KB 

rata 9-n 



12-19 



Register Address 1 -TheRAfteajia used to specify wNch of the 19 general roasters is 
to bfl used as an operand and&r destination. 

One - Tlw bit being one deQnes the operation to to a register storage operation., 
Register Date - These tits are considered a signed value which ks used to modify the 
contents of the program counter. Norma Wy.thb field would haw a value of one for the 
register Immediate format. 

Zero** . The field being 2*10 ie used to epec»y ihtf the updated profjrem ccumer, vAich 
points to the immediaiB dote field, is to be used as an storage address lor an operand, 
immediate Date * This field series as a 16 bit Immediate data operand for Register 
Immediate instmceons. 



25 $$ Format 

The Storage to Storage (SSj format provides two storage addresses, one expfcte and die second 
Implicit The Implied storage address Is contained In General Register 1. Register 1 Is modified during 
execution of the instruction- There are two forms of a SS instruction, a direct address form and a storage 
ao address form. 





OPX 






















(Direct 


OP 




0 


c 


e 


OIR ADDR 


Address 




0 


V 


R 






Form) 




p 


F 


Y 








■» I ■•- 


1 








•Mill 





0 3 4 7 8 9 1 

5 

4$ _ 





OPX 


























(Storage 


OP 




0 


c 


1 


DEL 


Re 


Address 




0 


V 


R 








Form) 


_1 1 1 


p 
1 


r 


Y 




1 1 


1 i 1 





0 3 4 7 8 9 11 1 
! 2 5 
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J3 



In Addition to on Operetta! Code told, the S3 fcrmai contains: 

Bits *7 Operation Extension Code - Tho OPX Bold, logother with mo Operation Code, defines 
(he operation to be performed. Bite 4-5 define the operation type sue* as ADD or 
SUBTRACT, Bhs 6-7 control the carry, overflow, and how Bra condition code will be eel. 
Bit 3 a 0 Ignores overflow, bit 8 = 1 allows overflow. Bft 7 => 0 ignore the cany start 
during the operaSon; bft 7 » 1 included She carry slat during the operation. 

Bit 8 Zero - Defines the form to be a direct address form. One - Defines ihe form to be a 

storage address form 

Bite 9-16 Direct Address (Direct Address Form) - The Direct Storage Address field Is used as 
an address into ihe level unique storage block or ihe common storage block. Bite d-1 1 of 
die direct address field must be non-zero te> define tho direct address form. 

Bits 9-11 Register Delta (Storage Address Form) - These bits are considered a signed value 
which is used to modify the contents of register specified by the RB Held. 

Bits 12.15 Register Address 2 (Storage Address Form) - The RB field is used to specify which 
of the 1 6 general registers is to be used es & storage address for an operand. 



SPC Format 1 



Hie Special (SPCf ) formal provides one general register storage operand address. 



20 



OP 

» 1 1 


OPX 

I 1, 1 


0 


LEH 

i t 


R6 

1 1 1 


e 3 


4 7 


8 1 
1 


1 1 

2 5 


OP 

2 i r 


OPX 

i i r 


1 


LEH 
i i 


RB 
.ill 


0 3 


4 7 


8 1 
1 


1 1 

2 5 



RR Form 



RS Form 



fri addition to an Operation Code field, the 9PC1 formal contains: 



30 



Bits 4-7 
Bit 8 

Bite 9-11 



Bits 12.15 



SPC Format 2 



OP Extension - The OPX iteid is used to extend the operation code. 

Zero or One - This ba being zero defines tie operation to be a register operation. This 

ba being one defines ffie operation Jo be a register storage operation. 

Operation Length - These bits are considered en unsigned value which is used to 

specify the lengtfi of the operand in 16 b'ri words. A value of zero corresponds to a length 

of one, and a value of B 1 11 V corresponds to & length of eight 

Register Address 2 - TTie RB field is used to specify which of me 13 General registers is 

to be used as a storage adctass for the operand. 



The Special (SPC2) formal provides one general register storage operand address. 
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OP 

I. i I 


RA 

I LI.. 


OPX 
i 1 1 


RB 

1 1 1 


0 3 


4 7 


8 I 


1 1 



30 



1 2 



(n addition to «n Operation Code field, the SPC2 formal contains: 

BBS 4-7 Register Address 1 - The RA fteU ts used Id specify which of the IS gsnerd rentiers is 

to be used as an operand endtor destination. 
8*38-11 OP Extension * Hie OPX field re used to extend the operaBon code. 
Bite 12-15 Register Address 2 - The RB field is used to specify which of the 13 general registers la 

to be used as a storage address to* the operand. 

THE INSTRUCTION LIST OF THE ISA INCLUDES THE FOLLOWING: 



Table 1 (Pags 1 of 3). Fk«M>olnt Amhmelic l^ructiorrt 
NAME 

AOD DIRECT 



mm 

dda 



TYPME 

OA 
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Table 1 f*ag* 2 ft* Fixed-Poinl Aritanuti* IftttreStons 





NAME 


MNE- 


TYPf 






MOWC 






AOD FROM STORAGE 


a 


RS 




(WITH DELTA) 


awd 


RS 




AOD IMMEDIATE 


ai 


Rl 


10 


(WITH DELTA) 


aiwd 


Rl 




AOO REGISTER 


ar 


RR 




COMPARE DIRECT ADDRESS 


cda 


DA 


iS 


COMPARE IMMEDIATE 


ci 


Rl 




(WITH DELTA) 


ciwd 


Rl 




COMPARE FROM STORAGE 


c 


RS 


20 


(WITH DELTA) 


cwd 


RS 




COMPARE REGISTER 




RR 




COPY 


CDV 


RS 


2$ 


{WITH DELTA) 


cpywd 


RS 




COPY WITH BOTH IMMEDIATE 


covbl 


Rl 




{WITH DELTA) 


cpyhiwd 


Rl 


30 


COPY IMMEDIATE 


CDV I 


Ri 


(WITH DELTA) 


cpyiwd 


Rl 




COPY DIRECT 


cpyda 


DA 




COPY DIRECT IMMEDIATE 


cpydai 


DA 


38 


INCREMENT 


inc 


RS 




(WITH DELTA) 


Incwd 


RS 




LOAD DIRECT 


Ida 


DA 




LOAD FROM STORAGE 


l 


RS 




(WITH DELTA) 


iwd 


RS 




LOAD IMMEDIATE 


ii 


Rl 




{WITH DELTA) 


fiwd 


Rl 




LOAD REGISTER 


Ir 


RR 




MULTIPLY SIGNED 


mpy 


SPC 




MULTIPLY SIGNED EXTENDED 


mpyx 


SPC 




MULTIPLY SIGNED EXTENDED IMMEDIATE 


nnpyxi 


SPC 
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T«b<« 1 (Pjoe 3 of 3 1, Kned-Poifli AsiihflMMK Fnsffvctions 

NAME MNE- TYPME 

s Motyc 

MULTIPLY SIGNED IMMEDIATE tnpyi SPC 

MULTIPLY UNSIGNED mpyu SPC 

MULTIPLY UNSIGNED EXTENDED rnpyux SPC 

MULTIPLY UNSIGNED EXTENDED IMMEDIATE mpyuxi SPC 

MULTIPLY UNSIGNED IMMEDIATE mpyui SPC 

STORE DIRECT stda DA 

" STORE st RS 

(WITH DELTA) atwd RS 

STORE IMMEDIATE sti Rl 

20 (WITH DELTA) stlwel Rl 

SUBTRACT DIRECT 8 da OA 

SUBTRACT FROM STORAGE s RS 

9 (WITH DELTA) swcl RS 

SUBTRACT IMMEDIATE &E Rl 

(WITH DELTA) a£*d R| 

SUBTRACT REGISTER sr RR 

SWAP AND EXCLUSIVE OR WITH STORAGE swapx RR 

as Table 7 (Page I of 3X Storage to Storage instructions 

NAME MNE- TYPME 

M9N IC 

ADD STORAGE TO STORAGE sa SS 

fWfTH DELTA) sawd SS 

ADD STORAGE TO STORAGE DIRECT aada SS 

AOD STORAGE TO STORAGE FINAL sat SS 

* (WITH OELTA) safwd SS 

ADD STORAGE TO STORAGE FINAL DIRECT safda SS 

ADD STORAGE TO STORAGE INTERMEDIATE sai SS 

50 (WITH DELTA) sahwd SS 
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Table 2 (Page 3 of 3). Storage to Storage Iralnr.tianc 

NAME MHEi TYPM E 

MOMIC 

ADO STORAGE TO STORAGE INTERMEDIATE 





DIRECT 


saida 


SS 




ADD STORAGE TO STORAGE LOGICAL 


sal 


SS 




(WITH DELTA) 


salwd 


SS 




ADD STORAGE TO STORAGE LOGICAL DIRECT 


salda 


SS 




COMPARE STORAGE TO STORAGE 




SS 




(WITH DELTA) 




SS 




COMPARE STORAGE TO STORAGE DIRECT 


scdd 






COMPARE STORAGE TO STORAGE FINAL 


scf 


SS 


<w 


fWlTH DELTAS 


erfufH 






COMPARE STORAGE TO STORAGE FINAL DIBFCT 


OWIUGi 


SS 




COMPARE STORAGE TO STORAGE INTERMEDIATE 


scl 


SS 




(WITH DELTA) 
COMPARE STORAGE TO STORAGE INTERMEDIATE 


sciwd 






DIRECT 


scida 


SS 




COMPARE STORAGE TO STORAGE LOGICAL 


scl 


SS 


3D 


(WITH DELTA) 
COMPARE STORAGE TO STORAGE LOGICAL 


sclwd 


SS 




DIRECT 


sdda 


SS 


38 


MOVE STORAGE TO STORAGE 


smov 


SS 




(WITH DELTA) 


smovwd 


SS 




MOVE STORAGE TO STORAGE DIRECT 


smovda 


SS 




SUBTRACT STORAGE TO STORAGE 


93 


SS 




(WITH DELTA) 


33Wd 


SS 




SUBTRACT STORAGE TO STORAGE DIRECT 


ssda 


SS 




SUBTRACT STORAGE TO STORAGE PINAL 


5$1 


SS 




<WITH DELTA) 


ssfwd 


SS 




SUBTRACT STORAGE TO STORAGE FINAL DIRECT 


ssfda 


SS 




SUBTRACT STORAGE TO STORAGE INTERMEDIATE 


ssi 


SS 




(WITH DELTA) 


ssiwd 


SS 
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Tablo 2 (Page 3 of J). Stof sge to Storage Insiruelians 
NAME 

SUBTRACT STORAGE TO STORAGE INTERMEDIATE 
DIRECT 

SUBTRACT STORAGE TO STORAGE LOGICAL 
10 (WITH DELTA) 

SUBTRACT STORAGE TO STORAGE LOGICAL 
DIRECT 



Tab!* 3 

20 



29 



Logical nstruotlons 


NAME 


MNEMONIC 


TYPME 


AND DIRECT ADDRESS 


nda 


DA 


AND FROM STORAGE 


a 


RS 


<WITH DELTA) 


mvd 


RS 


AND IMMEDIATE 


nl 


Rl 


(WITH DELTA) 


nfwd 


Ri 


AND REGISTER 


ro- 


RR 


OR DIRECT ADDRESS 


od* 


DA 


OR FROM STORAGE 


0 


RS 


iWITH DELTA) 


owd 


RS 


OR IMMEDIATE 


d 


Rl 


(WITH DELTA) 


otwd 


Rl 


OR REGISTER 


or 


RR 


XOR DIRECT ADDRESS 


xda 


DA 


XOR FROM STORAGE 


X 


RS 


(WITH DELTA) 




RS 


XOR IMMEDIATE 


xt 


Rl 


(WITH DELTA) 


xtaJ 


Rl 


XOR REGISTER 


XT 


RR 



MMfc TYPME 

mm 

saida SS 

ssl SS 

aslwd SS 

sslda SS 
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Tgwe 4 JPaea 1 of Zj, Stun Instmctiont 

MNE» TYPME 

9 mm 

SCALE BINARY spc 

SCALE BINARY IMMEDIATE scate! SPC 

SCALE BINARY REGISTER scaler SPC 

SCALE HEXADECIMAL 5ca!eh spc 

SCALE HEXADECIMAL IMMEDIATE scaled SPC 



;s 


^w-MLfc ntAAUcvljvlAL REGISTER 


scalehr 


SPC 




»rllrT left arithmetic BINARY 


sla 


SPC 




SHIFT LEFT ARITHMETIC BINARY IMMEDIATE 


slai 


SPC 


20 


SHIFT LEFT ARITHMETIC BINARY REGISTER 


star 


SPC 




SHIFT LEFT ARITHMETIC HEXADECIMAL 


slab 


SPC 




SHIFT i PPT APITUMCTIr uevAne^iRiii niiip^iivp 

^niri u.ri Aftf mivifcVIG: HEXADECIMAL IMMEDIATE 


siahj 


SPC 




SHIFT LEFT ARITHMETIC HEXADECIMAL REGISTER 


slahr 


SPC 




SHIFT LEFT LOGICAL BINARY 


sll 


SFC 




SHIFT LEFT LOGICAL BINARY IMMEDIATE 


sill 


SPC 


so 


SHIFT LEFT LOGICAL BINARY REGISTER 


slJr 


SPC 




SHIFT LEFT LOGICAL HEXADECIMAL 


sllh 


SPC 




SHIFT LEFT LOGICAL HEXADECIMAL IMMEDIATE 


slihi 


SPC 


35 


SHIFT LEFT LOGICAL HEXADECIMAL REGISTER 


sJMir 


SPC 




SHIFT RIGHT ARITHMETIC BINARY 


sra 


SPC 




SHIFT RIGHT ARITHMETIC BINARY IMMEDIATE 


srai 


SPC 




SHIFT RIGHT ARITHMETIC BINARY REGISTER 


srar 


SPC 




SHIFT RJGHT ARITHMETIC HEXADECIMAL 


sr*h 


SPC 




SHIFT RIGHT ARITHMETIC HEXADECIMAL IMME- 


sraftl 


SPC 




DIATE 






<* 


SHIFT RIGHT ARITHMETIC HEXADECIMAL REGISTER 




SPC 




SHIFT RIGHT LOGICAL 8INARY 


srt 


SPC 


SO 


SHIFT RIGHT LOGICAL BINARY IMMEDIATE 


srli 


SPC 
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Tefcla 4 {Pag* 2 of 2). ShUi rm<rgc<i$n* 

NAME MM: HEMI 

MOM1C 

SHIFT RIGHT LOGICAL BINARY REGISTER srlr SPC 

SHIFT RIGHT LOGICAL HEXADECIMAL Sfih SPC 

SHIFT RIGHT LOGICAL HEXADECIMAL IMMEDIATE srlhi SPC 

90 SHIFT RIGHT LOGICAL HEXADECIMAL REGISTER srlhr SPC 

?5 Table 5 (Pag« f of 2). Branch Instructions 

NAME MN&> TVPME 

^ BRANCH b RS 

(WITH DELTA) bwd RS 

BRANCH DIRECT bda DA 

BRANCH IMMEDIATE bi Rl 

as 

(WITH DELTA) biwd Rl 

BRANCH REGISTER br RS 

BRANCH ANO LINK bal RS 

BRANCH AND LINK DIRECT balda DA 

BRANCH AND LINK IMMEDIATE ball Rl 

(WITH DELTA) baU>*d Rl 

35 BRANCH AND LINK REGISTER balr RS 

BRANCH BACKWARD bb RS 

(WITH DELTA) bbwd RS 

" BRANCH BACKWARD DIRECT bbda DA 

BRANCH BACKWARD IMMEDIATE bbl Rl 

(WITH DELTA) bbiwd Rl 

<* BRANCH BACKWARD REGISTER bbr RS 

BRANCH FORWARD bf RS 

(WITH DELTA) biwd RS 

so BRANCH FORWARD DIRECT bfda DA 
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JO 



TaMe 5 {Page 2 of 2). Branch foMructioos 
NAME 

BRANCH FORWARD IMMEDIATE 

(WITH OELTA) 
BRANCH FORWARD REGISTER 
BRANCH ON CONDITION 

(WITH DELTA) 
BRANCH ON CONOITION DIRECT 
BRANCH ON CONDITION IMMEDIATE 

(WITH DELTA) 
BRANCH ON CONDITION REGISTER 
BRANCH RELATIVE 

(WITH DELTA) 
NULL OPMERATION 



MNE- 
MONIC 
bfi 

bfiwd 

bfr 

be 

bewd 
beda 
bci 

bdwd 

bcr 

brel 

brelwd 

nooi> 



HEM! 

Rl 

Rl 

RS 

RS 

RS 

RS 

Rl 

Rl 

RS 

Rl 

RS 

RR 







Tawed 




30 


Stasus Switching instruct cm 




NAME 


MNEMOttfC 


TVPME 




RETURN 


ret 


SPC 


35 




Tabid 7 





(npuf Output Instructions 


NAME 


MNEMONIC 


TYPME 


IN 


IN 


SPC 


OUT 


OUT 


SPC 


INTERNAL DIGR/DIOW 


INTR 


SPC 



SOME SUMMARY FEATURES 



The apap Machine fai PoropocHvo 



We have described ft accordance with our Ertven&on could be thought oT in its more detaBed aspects to 
be positioned In the technology somewhere between the CM-i and N-cube. Like out APAP, the CM-1 uses 
a poffii design for the processing element and combines processing elements with memory on the basic 
ch\p. The CM-1 % however uses a 1 bit wide serial processor ^hils [he APAP series will use a ie bit wide 
processor The CM series of machines started *Kh 4K bite of memory per processor and lies ©town to a or 
16K bte versus tfie 32K by ie bits we have provided for the first APAP chip. The CM-1 and to fcOow-ons 
are strictly SIMD machines while die CM-5 Is a hybrid. Instead or this, our APAP wttl effectively use MIMD 
operattng modes in conjunction with QMD mode$ when ussfui While our parallel 16 bit wide PMEs might 
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be viewed as a step toward me N-cub«, Svs step is not warranted. Tha APAP does not sap&rate memory 
and routing from tire processing element as does the N-cube kind of machine. Also, the APAP provides for 
up to 32K 16 bit PMEs while tha N-cube only provide for 4K 32 bit processors, 

Evan with the superficial similarities presented above, tJw APAP concept completely differs from tha 
5 CM end N-cube series by: 

1. The modified hypeicubs incorporated in our APAP is a new invention providing a significant packaging 
and addressing advantage whan compared with hypercube topologies. Per Instance, consider that tha 
32K PME APAP in its first preferred embodiment has a network diameter of 19 logical steps and with 
transparency, this can be reduced to en efteceve 16 logical step* Further, by oorr^sfeon, it a pure 

jo hypercube were used, end if all PMEs were sending data through an 8 step path, then on average 2 of 
every 0 PMEs would be active while the remainder *ouH be delayed due to blockage. 

Atematively, consider the 64K hypercube that would be needed it CM-1 was a pure hypercube. In 
that case, each PME would require ports to 16 other PMEs, and data could be routed between the two 
fei&est separated PMEs to 15 logical steps, if en PMEs tied to transfer an average distance of 7 steps, 

is the 2 of every 7 would be active. However, CM-1 dees not utilise e 16d hypercube. & interconnects the 
18 nodes on a chip with a NEWS networic then it provides one router function within the chip. The 4086 
routers are connected into a I2d hypercube Wsh no collisions the hybrid still has a logical diameter of 
15, but skice 16 PMEs could be contending for fw fak He effective diameter is much greater. That is, 
wtlh 8 step moves only 2 of 16 PMEs could be active, which means that 8 complete cycles rather than A 

20 cycles are needed to complete en date moves. 

The N-cube sctuaRy utilizes a pure hypercube, but currently only provides ior a 4096 PMEs and 
thus, utilizes a i2d {1 3d for 8102 PMEs} hypercube. For the N-eube te grow te ieK processors, at which 
point ft would have the same processing data width as the APAP, It would have to add four 3mes aa 
much hardware and would have to increase the connection parte to each PME router by 25%, Although 

9 no hard date exists to support this conclusion, & would appear that the N-cube architecture runs out of 
connecter pins prior to reaching a 16K PME machine. 

2. The completely integrated and distributed nature of major tasks within the APAP machine is a decided 
advantage. As was noted for the CM and N-cube aeries of machines, each had to have separate unite fci 
message routing as wefl as separate units tor floating point coprocessors. The APAP system combines 

30 the Integer, floating point processing, message routing and I/O control Into Ste single point design PME. 
"Hurt design then replicated 8 ernes on a chip, and the chip is alee replicated 4K times to produce the 
array. This provides several advantages: 

a. Using one chip means maximum size production nine and minimal system factor cods* 

b. Regular architecture produces the most effective programming systems. 

33 c Almost al chip pins can be dedicated to the generic problem of intorprocessor communication, 

maximizing the Inter-chip I/O bandwidth which tends to be a important RmUlng factor to mpp designs. 

3. The APAP has the unique design abiffly to take advantage of chip technology gams and capital 
investment in custom chip designs. 

Consider the question of floating point performance. It is enfclpated that APAP PME performance on 
40 DAXPY will be about 125 cycles per flop. In contrast, me '387 C^roceseor would be about 14 cycles 
while the WeHeo Coprocessor in the CM-1 would be about 8 circles. However, in the CM case there Is 
only one floating point unit for every is PMEs whfie to the N-cube case there ts probably one '387 type 
chip associated w&h each of the *3€& processors. Our APAP has 16 times as many PMEs and therefore 
can almost completely make up for the single unit performance delta. 
4S More significantly, the 8 apap pues within a chip are constructed from 50K gates currently 

available in the technology. As memory macros shrink and tie number of gates available to the bgrc 
increases, Spending that increase on enhanced floating point normalization should permit APAP Heating 
point performance to far exceed the other units. Alternatively, effort could be spent to generate a Pfvt= or 
PME subsection design using custom design approaches, enhancing total performance while to no way 
so affecting any S/W developed for the machine. 

We believe our design for our APAP has tfwacrerisScs poised to take advantage or the future 
process technology growth. In contrast me nearest similar machines CM-x and N-cube which employ a 
system like that described in FIGURE 1 seem well poised to take advantage of yesterday's technology 
which we leel is dead ended, 
ss An advantage of the APAP concept is the abiSty to use OASO associated with groups of PMEs. This 
apap capability, as wes as the ability to connect displays and auxii**y storage, is a by-product of picking 
MC bus structures as me interface to the external MO ports of tie PME Array. Thus, APAP systems will be 
configurable and can include card mounted hard drives selected from one of the set of unite that are 
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compatible wth PS/2 or RISC/8000 units. Further, that capability should be available without designing any 
additional pari number modules aimough it does require utilizing mors repScatema of the backpanel and 
base enclosure than does me arap. 

This brief perspective is not intended to be limiting, but rather is intended to cause those skilled in the 

6 art to review the foregoing description and examine how the many Inventions we have described which may 
be used to move me art ot massively parallel systems ahead to a One when programing ta no longer a 
significant problem and the costs of such systems are much lower. Our kind of system can be made 
available* not only to Vie few, but to many as it could be made at a cost vrithin the reach of commercial 
department level procurements. 

70 While we have described our preferred embodiments of our invention, it win be understood thai those 
skilled in the art both now and in the future, upon the understanding of these discussions win make various 
improvements and enhancements thereto which fall within the scope of the claims which fottow. Those 
claims should be construed lo maintain the proper protection for the invention first cfectosed, 

;s Claims 

1. A computer system comprising, e plurality of mufH-processor memory elements, each having commu- 
nication paths, processor and memory, and wherein a programmable router is provided tor routing data 
and control Information from one multt-crocesser memory element to another multi-processor memory 

as element and between nodes of the compter system. 

2. A computer system according to claim 1 wherein each mum-processor memory dement <PME) has 2n 
processors, and communication para which minimize delays due to chip crossings. 

£0 3b A computer system according to claim 1 wherein each mule'-processor memory element (PME) has a 
processor, memory and routers wftfiin a single chip and internal and external communication paths 
which minimize delays due to crop crossings* each processor memory element having means tor fixed 
and floating point processing, rotting and \fO controL 

M 4 A computer system accorcSng to claim 1 firmer comprising within a processor memory element 

a native instruction set means for providing an expendable multiply function, a programmer router for 
routing information alternatively teftfrighi, 

NEWS matrix* fffiWS/up-dowfv hypercube. and wherein said program triable router is employe a 
hardwired distributed router provided by each processing memory element. 

38 

5- A computer computer system according to claim 1 organized as a massively parallel machine wift 
nodes interconnected as a n dimensional network cluster with parallel communication paths beiween 
processor memory dements along said Internal and external communication paths providing a process- 
ing array, and wherein processing memory elements of an array have a transparent mode uelzed when 
routing oats between processing memory elements within a chip set ot process^ memory elements 
for permitting reduction of the effective network diameter of a network of nodes, 

6. A computer system according to claim i wherein a node of a processor array has mutfcte emgfb 
processor elements made up of 32K Id-bit words with a 'Ifl-bil processor for a network node of eight 

& processors with then associated memory with their tolly districted I/O routers and signal I/O ports. 

7. A computer system according lo daim f wherein a node of a processor array has muJUpte single 
processor elements made up of 32K id-bit words with a i&bit processor tor a network node of eight 
processors with their associated memory with their firfty distributed K> routers and signal I/O ports 

30 combined as groups groups of node clusters organized as a 2d modified hypercube, 

& A computer system according to claim i wherein a node of a processor array has rnuMpte single 
processor elements made up of 02K 16-btf vwda with a ia-bll processor tor a network node of eight 
processors with their associated memory with their fu)ry distributes K? routers and signal VO ports 
so combined as groups groups of nods clusters organized as a 2d modified hypercube, with up to 64 
creators integrated in a network of node clusters to torn ere integrated to form a id mooltled 
hypercube of up to 32,768 processing memory elements. 
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9. A computer sywem according to claim 1 tvheretn a nod* processing memory element has Internal data 
flows using high epeod hard registers to feed distributed ALU and and WD router registers and logic Tor 

fil Operations. 

IQl A computer system according to data 1 wherein a node processing memory element in has an VO 
port for oft crop byte wide oonrrotricetion. and has topi* pom) that are connected aucri ftet data may 
be routed from Input to memory, or from an Input address register to an output register via a direct 
Derate! data path. 

It A computer system according to claim 1 wherein a node has multiple processor memory elements and 
is connected to other nodes In a dusfc* network wtt data routing distributed between hardware and 
software, wth software controlling most of the task sequencing {unction. 

12. A computer system according to claim l wherein a node has multiple processor memory elements and 
la connected to other nodes a duster network wfth data routing distributed between hardware and 
software, wtth hardware provided for performing Inner loop transfers and minimizing overhead on the 
outer loops of the node. 

13L A computer system according to claim 1 wherein a node has multiple processor memory elements end 
is connected to other node* in a cluster network with vo programs at dedicated interrupt level* «or 
managing the network. 

14* A computer system according to cUum i wherein a node has multiple processor memory element* end 
is connected to other nodes in a olueter network whit VO programs oi dedicated interrupt levels for 
managing the network* each processor memory element having interrupt registers end dedicating tour 
interrupt levels to receiving data from four neighbors, a buffer provided at each level by loading 
registers ai the level, and having in and return instruction pairs using a buffer address and transfer 
count to enable the processor memory element to accept words from an Input bus and to s»ie them to 
the burfer, 

IS. A mgtfrprocessor memory system, comprising: a plurality of muff-processor memory elements, each 
multiprocessor memory element <PME) having 2n processors, memory and routers within a single chip 
and internal and external csommunteatton pattis which mlnimiie delays due to chip crossings. 
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